Springer Series in the Data Sciences 


Ovidiu Calin 


Deep Learning 
Architectures 

A Mathematical Approach 



Springer 



Springer Series in the Data Sciences 


Series Editors 

David Banks, Duke University, Durham, NC, USA 

Jianqing Fan, Department of Financial Engineering, Princeton University, 

Princeton, NJ, USA 

Michael Jordan, University of Califomia, Berkeley, CA, USA 

Ravi Kannan, Microsoft Research Labs, Bangalore, India 

Yurii Nesterov, CORE, Universite Catholique de Louvain, Louvain-la-Neuve, 

Belgium 

Christopher Re, Department of Computer Science, Stanford University, Stanford, 
USA 

Ryan J. Tibshirani, Department of Statistics, Carnegie Melon University, 
Pittsburgh, PA, USA 

Larry Wasserman, Department of Statistics, Carnegie Mellon University, 
Pittsburgh, PA, USA 



Springer Series in the Data Sciences focuses primarily on monographs and graduate 
level textbooks. The target audience includes students and researchers working in 
and across the fields of mathematics, theoretical computer Science, and statistics. 
Data Analysis and Interpretation is a broad field encompassing some of the 
fastest-growing subjects in interdisciplinary statistics, mathematics and computer 
Science. It encompasses a process of inspecting, cleaning, transforming, and 
modeling data with the goal of discovering useful information, suggesting 
conclusions, and supporting decision making. Data analysis has multiple facets 
and approaches, including diverse techniques under a variety of names, in different 
business, Science, and social Science domains. Springer Series in the Data Sciences 
addresses the needs of a broad spectrum of scientists and students who are utilizing 
quantitative methods in their daily research. The series is broad but structured, 
including topics within all core areas of the data Sciences. The breadth of the series 
reflects the variation of scholarly projects currently underway in the field of 
machine learning. 


More information about this series at http://www.springer.com/series/13852 


Ovidiu Calin 


Deep Leaming Architectures 

A Mathematical Approach 


Springer 



Ovidiu Calin 

Department of Mathematics & Statistics 
Eastem Michigan University 
Ypsilanti, MI, USA 


ISSN 2365-5674 ISSN 2365-5682 (electronic) 

Springer Series in the Data Sciences 

ISBN 978-3-030-36720-6 ISBN 978-3-030-36721-3 (eBook) 

https://doi.org/10.1007/978-3-030-36721-3 

Mathematics Subject Classification (2010): 68T05, 68T10, 68T15, 68T30, 68T45, 68T99 
© Springer Nature Switzerland AG 2020 

This work is subject to Copyright. All rights are reserved by the Publisher, whether the whole or part 
of the materiat is concerned, specifically the rights of translation, reprinting, reuse of illustrations, 
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission 
or information storage and retrieval, electronic adaptation, computer Software, or by similar or dissimilar 
methodology now known or hereafter developed. 

The use of general descriptive names, registered names, trademarks, Service marks, etc. in this 
publication does not imply, even in the absence of a specific statement, that such names are exempt from 
the relevant protective laws and regulations and therefore free for general use. 

The publisher, the authors and the editors are safe to assume that the advice and information in this 
book are believed to be true and accurate at the date of publication. Neither the publisher nor the 
authors or the editors give a warranty, expressed or implied, with respect to the material contained 
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard 
to jurisdictional claims in published maps and institutional afhliations. 

This Springer imprint is published by the registered company Springer Nature Switzerland AG 
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland 


Foreword 


The multiple commercial applications of neural networks have been highly 
profitable. Neural networks are imbedded into novel technologies, which are 
now used successfully by top companies such as Google, Microsoft, Facebook, 
IBM, Apple, Adobe, Netflix, NVIDIA, and Baidu. 

A neural network is a collection of computing units, which are connected 
together, called neurons, each producing a real-valued outcome, called acti- 
vation. Input neurons get activated from the sensors that perceive the 
environment, while the other neurons get activated from the previous neuron 
activations. This structure allows neurons to send messages among them- 
selves and consequently, to straighten those connections that lead to success 
in solving a problem and diminishing those which are leading to failure. 

This book describes how neural networks operate from the mathematical 
perspective, having in mind that the success of the neural networks methods 
should not be determined by trial-and-error or luck, but by a ciear mathe¬ 
matical analysis. The main goal of the present work is to write the ideas and 
concepts of neural networks, which are used nowadays at an intuitive level, 
into a precise modern mathematical language. The book is a mixture of old 
good classical mathematics and modern concepts of deep learning. The main 
focus is on the mathematical side, since in today’s developing trend many 
mathematical aspects are kept silent and most papers underline only the 
computer Science details and practical applications. 

Keeping the emphasis on the mathematics behind the scenes of neural 
networks, the book also makes connections with Lagrangian Mechanics, 
Functional Analysis, Calculus, Differential Geometry, Equations, Group 
Theory, and Physics. 

The book is not supposed to be exhaustive in methods and neural network 
type description, neither a cookbook of machine learning algorithms. 
However, it concentrates on the basic mathematical principies, which govern 
this branch of Science, providing a ciear theoretical motivation for why do 
neural nets work that well in so many applications, providing students the 
tools needed for the analytical understanding of this field. This is neither a 
programming book, since it contains no code. Most computer Software for 
neural nets become easily obsolete in this age of technological revolution. The 
book treats the mathematical concepts that lay behind neuronal networks 


v 



VI 


Deep Learning Architectures, A Mathematica1 Approach 


algorithms, without explicitly implementing code in any specific program- 
ming language. 

The readers interested in practical aspects of neural networks including the 
programming point of view are referred to several recent books on the subject, 
which implement machine learning algorithms into different programming 
languages, such as TensorFlow, Python, or R. A few examples of books like 
that are: Deep Learning with Python by F. Chollet, Machine Learning with 
Python Cookbook: Practical Solutions from Preprocessing to Deep Learning 
by C. Albon, Python Machine Learning: Machine Learning and Deep Learning 
with Python, Scikit-learn, and TensorFlow by S. Raschka and V. Mirjalili, 
Introduction to Deep Learning Using R: A Step-by-Step Guide to Learning and 
Implementing Deep Learning Models Using R by T. Beysolow, Deep Learning: 
A Practitioners Approach by J. Patterson and A. Gibson, Hands-On Machine 
Learning with Scikit-Learn and TensorFlow: Concepts, Tools and Techniques 
to Build Intelligent Systems by A. Geron, and Deep Learning for beginners: A 
Practical Guide with Python and Tensor Flow by F. Duval. 

The targeted audience for this text are graduate students in mathematics, 
electrical engineering, statistics, theoretical computer Science, and econo- 
metrics, who would like to get a theoretical understanding of neural net¬ 
works. To this end, the book is more useful for researchers rather than 
practitioners. In order to preserve completeness, we have included in the 
appendix-relevant material on measure theory, probability, and linear alge- 
bra, as well as real and functional analysis. Depending on their background, 
different readers will find useful different parts of the appendix. For instance, 
most graduate students in mathematics, who typically take basic functional 
analysis and real analysis, will find accessible most of the book, while 
statistics students, who take a minimal amount of measure theory and 
probability, will find useful to read the analysis part of the appendix. Finally, 
computer Science students, who usually do not have a strong mathematical 
background, will find useful most of the appendix. The minimal prerequisites 
to read this book are: Calculus I, Linear Algebra with vectors and matrices, 
and elements of Probability Theory. 

However, we recommend this book to students who have already had an 
introductory course in machine learning and are further interested to deepen 
their understanding of the machine learning material from the mathematical 
point of view. The book can be used as a one-semester, or a two-semester 
graduate course, covering parts I—II and, respectively, I-V. 

The book contains a fair number of exercises, mostly solved or containing 
hints. The proposed exercises range from very simple to rather difhcult, many 
of them being some applications or extensions of the presented results. The 
references to the literature are not exhaustive. Anything even approaching 
completeness was out of the question. 

Despite the large number of neural network books published in the last few 
years, very few of them are of a theoretical nature, which discuss the 
“mathematical concepts” underlying neural networks. One of the well-known 



Foreword 


Vll 


books in the field, providing a comprehensive review of deep learning liter- 
ature, is Goodfellow et al. [46], which can be used both as a textbook and as a 
reference. Another book of great inspiration is the Online book of Nielsen [92]. 
Also, the older book of Roja [101] presents a classical introduction to neural 
networks. A notable book containing a discussion of statistical machine 
learning algorithms such as “artificial neural net works” is Hastie et al. [53]. 
The mathematical concepts underlying neural networks are also discussed in 
the books of Mohri et al. [88] and Shalev-Shwartz et al. [110]. 

Even if the previous books cover important aspects related to statistical 
learning and mathematical statistics of deep learning, or the mathematics 
relevant to the computational complexity of deep learning, there is stili a 
niche in the literature, which this book attempts to address. This book 
attempts to provide a useful introductory material discussion of what types of 
functions can be represented by deep learning neural networks. 

OverView 

The book is structured into four main parts, from simple to complex topics. 
The first one is an accessible introduction to the theory of neural networks; 
the second part treats neural networks as universal approximators; the third 
regards neural nets as information processors. The last part deals with the 
theory behind several topics of interest such as pooling, convolution, CNNs, 
RNNs, GANs, Boltzmann machines, and classification. For the sake of 
completeness, the book contains an Appendix with topics collected from 
Measure Theory, Probability Theory, Linear Algebra, and Functional 
Analysis. Each chapter ends with a summary and a set of end-of-chapter 
exercises. Full Solutions or just hints for the proposed exercises are provided 
at the end of the book; also a comprehensive index is included. 

Part I 

This is the most elementary part of the book. It contains classical topics 
such as activation functions, cost functions, and learning algorithms in neural 
networks. 

Chapter 1 introduces a few daily life problems, which lead toward the 
concept of abstract neuron. These topics introduce the reader to the process 
of adjusting a rate, a flow, or a current that feeds a tank, cell, fund, transistor, 
etc., which triggers a certain activation function. The optimization of the 
process involves the minimization of a cost function, such as volume, energy, 
potential, etc. More neural units can work together as a neural network. As 
an example, we provide the relation between linear regression and neural 
networks. 

In order to learn a nonlinear target function, a neural network needs to use 
activation functions that are nonlinear. The choice of these activation func¬ 
tions defines different types of networks. Chapter 2 contains a systematic 


Vlll 


Deep Learning Architectures, A Mathematical Approach 


presentation of the zoo of activation functions that can be found in the lit- 
erature. They are classified into three main classes: sigmoid type (logistic, 
hyperbolic tangent, softsign, arctangent), hockey-stick type (ReLU, PReLU, 
ELU, SELU), and bumped type (Gaussian, double exponential). 

During the learning process a neural network has to adjust parameters 
such that a certain objective function gets minimized. This function is also 
known under the names of cost function, error function, or loss function. 
Chapter 3 describes some of the most familiar cost functions used in neural 
networks. They include the following: the supremum error function, L 2 -error 
function, mean square error function, cross-entropy, Kullback-Leibler 
divergence, Hellinger distance, and others. Some of these cost functions are 
suited for learning random variables, while others for learning deterministic 
functions. 

Chapter 4 presents a series of classical minimization algorithms. They are 
needed for the minimization of the associated cost function. The most used is 
the Gradient Descent Algorithm, which is presented in full detail. Other 
algorithms contained in the chapter are the linear search method, momentum 
method, simulated annealing, AdaGrad, Adam, AdaMax, Hessian, and 
NewtoiTs methods. 

Chapter 5 introduces the concept of abstract neuron and presents a few 
classical types of neurons, such as: the perceptron, sigmoid neuron, linear 
neuron, and the neuron with a continuum input. Also, some applications to 
logistic regression and classification are included. 

The study of networks of neurons is done in Chap. 6. The architecture of a 
network as well as the backpropagation method used for training the network 
is explained in detail. 

Part II 

The main idea of this part is that neural networks are universal approxi- 
mators, that is, the output of a neural network can approximate a large 
number of types of targets, such as continuous function, square integrable, or 
integrable functions, as well as measurable functions. 

Chapter 7 introduces the reader to a number of classical approximation 
theorems of analytic flavor. This powerful tool with applications to learning 
contains Dini’s Theorem, Arzela-Ascoli’s Theorem, Stone-Weierstrass’ 
Theorem, Wiener’s Tauberian Theorems, and the Contraction Principle. 

Chapter 8 deals with the case when the input variable is bounded and 
1-dimensional. Besides its simplicity, this case provides an elementary 
treatment of learning and has a constructive nature. Both cases of 
multi-perceptron and sigmoid neural networks with one hidden layer are 
covered. Two sections are also dedicated to learning with ReLU and Softplus 
functions. 

Chapter 9 answers the question of what kind of functions can be learned 
by one-hidden layer neural networks. It is based on the approximation theory 


Foreword 


IX 


results developed by Funahashi, Hornik, Stinchcombe, White, and Cybenko 
in late 1980s and early 1990s. The chapter provides mathematical proofs of 
analytic flavor that one-hidden layer neural networks can learn continuous, 
L 1 and L 2 -integrable functions, as well as measurable functions on a compact 
set. 

Chapter 10 deals with the case of exact learning, namely, with the case of 
a network that can reproduce exactly the desired target function. The 
chapter contains results regarding the exact learning of finite support func¬ 
tions, max functions, and piecewise linear functions. It also contains 
Kolmogorov-Arnold-Sprecher Theorem, Irie and Miyake’s integral formula, 
as well as exact learning using integral kernels in the case of a continuum 
number of neurons. 


Part III 

The idea of this part is that neural networks can be regarded as Infor¬ 
mation processors. The input to a neural net contains informat ion, which is 
processed by the multiple hidden layers of the net work, until certain char- 
acteristic features are selected. The output layer contains the compressed 
information needed for certain classification purposes. The job of a neural 
network can be compared with the one done by an artist in the process of 
making a sculpture. In the flrst stage the artist prepares the marble block, 
and then he cuts rough shapes. Then he continues with sketching general 
features and ends with refined details. Similarly with the artist, who removes 
the right amount of material at each step, a neural network neglects at each 
layer a certain amount of irrelevant information, such that at the end only 
the desired features are left. 

Chapter 11 models the information processed by a neural net from the 
point of view of sigma-fields. There is some information lost in each layer of a 
feedforward neural net. Necessary conditions for the layers with trivial lost 
information are provided, and an information description of several types of 
neurons is given. For instance, a perceptron learns the information provided 
by half-planes, while a multi-perceptron is able to learn more sophisticated 
structures. Compressible layers are also studied from the point of view of 
information compression. 

Chapter 12 is a continuation of the previous one. It deals with a quan- 
titative assessment of how the flow of information propagates through the 
layers of a neural network. The tools used for this activity are entropy, 
conditional entropy, and mutual information. As an application, we present a 
quantitative approach of network’s capacity and information bottleneck. 
Some applications for the information processing with the MNIST data set 
are provided at the end of the chapter. 


X 


Deep Learning Architectures, A Mathematical Approach 


Part IV 

This part contains elements of geometric theory applied to neural net- 
works. Each neural network is visualized as a manifold endowed with a 
metric, which can be induced from the target space or can be the Fisher 
informat ion metric. 

Since the output of any feedforward neural net can be parametrized by its 
weights and biases, in Chap. 13 we shall consider them as coordinate systems 
for the network. Consequently, a manifold is associated with each given 
network, and this can be endowed with a Riemannian metric, which describes 
the intrinsic geometry of the network. Each learning algorithm, which 
involves a change of parameters with respect to time, corresponds to a curve 
on this manifold. The most efficient learning corresponds to the shortest 
curve, which is a geodesic. The embedded curvature of this manifold into the 
target space can be used for regularization purposes, namely, the flatter the 
manifold, the least overfitting to the training data. 

Chapter 14 is similar with the previous one, the difference being that 
neurons are allowed to be noisy. In this case the neural manifold is a manifold 
of probability densities and the intrinsic distance between two networks is 
measured using the Fisher informat ion metric. The smallest variance of the 
estimation of optimal density parameters has a lower bound given by the 
inverse of the Fisher metric. The estimators which reach this lower bound are 
called Fisher-efficient. The parameters estimated by online learning or by 
batch learning are efficient in an asymptotic way. 

Part V 

This part deals with a few distinguished neural network architectures, 
such as CNNs, RNNs, GANs, Boltzmann machines, etc. 

Pooling and convolution are two machine learning procedures described in 
Chaps. 15 and 16, respectively. Pooling compresses information with irre- 
mediable loss, while convolution compresses information by smoothing using 
a convolution with a kernel. The equivariance theory is also presented in 
terms of convolution layers. 

Chapter 17 deals with recurrent neural networks, which are specialized in 
Processing sequential data such as audio and video. We describe their 
training by backpropagation through time and discuss the vanishing gradient 
and exploding gradient problems. 

Chapter 18 deals with the process of classification of a neural network. It 
includes the linear and nonlinear separability of clusters, and it studies 
decision maps and their learning properties. 

Chapter 19 treats generative networks, of which GANs play a Central 
role. The roles of discriminator and generator networks are described, and 
their optimality is discussed in the common setup of a zero-sum game, or a 
mini-max problem. 


Foreword 


xi 


Chapter 20 contains a presentation of stochastic neurons, Boltzmann 
machines, and Hopfield networks as well as their applications. We also pre- 
sent the equilibrium distribution of a Boltzmann machine, its entropy, and 
the associated Fisher informat ion metric. 

Bibliographical Remarks 

In the following we shall make some bibliographical remarks that will place 
the subject of this book in a proper historical perspective. The presentation is 
not by far exhaustive, the reader who is interested in details being referred to 
the paper of Jurgen Schmidhuber [109], which presents a chronological review 
of deep learning and contains plenty of good references. 

First and foremost, neural networks, as variants of linear or nonlinear 
regression, have been around for about two centuries, being rooted in the 
work of Gauss and Legendre. The concept of estimating parameters using 
basic functions, such as Fourier Analysis methods, can be also viewed as a 
basic mechanism of multilayer perceptrons operating in a time domain, which 
has been around for a while. 

Prior to 1980s ali neural networks architectures were “shallow”, i.e., they 
had just very few layers. The flrst attempt to model a biologic neuron was 
done by Warren S. McCulloch and Walter Pitts in 1943 [82]. Since this model 
had serious learning limitations, Frank Rosenblatt introduced the multilayer 
perceptron in 1959, endowed with better learning capabilities. Rosenblatfs 
perceptron had an input layer of sensory units, hidden units called associat ion 
units, and output units called response units. In fact, the perceptron was 
intended to be a pattern recognition device, and the association units cor- 
respond to feature or pattern detectors. The theory was published in his 1961 
book, “Principies of neurodynamics: Perceptrons and the theory of brain 
mechanism”, but it was haunted by the lack of appreciation at the time due to 
its limitations, as pointed out in the book of Minsky and Papert [87]. 

Learning in Rosenblatt’s multilayer perceptron is guaranteed by a con- 
vergence theorem, which assures learning in finite time. However, in most 
cases this convergence is too slow, which represents a serious limitation of the 
method. This is the reason why the next learning algorithms involved a 
procedure introduced by Hadamard in as early as 1908, called the gradient 
descent method. Shunichi-Amari in his 1967 paper [3] describes the use of 
gradient descent methods for adaptive learning machine. The early use of the 
gradient descent method has been also employed by Kelley [62], Bryson [18], 
Dreyfus [33], etc., in the same time period. The previously described period 
lasting between 1940 and 1960 and dealing with theories inspired by bio- 
logical learning is called cybernetics. 

However, the first artificial neural net work that was “deep” indeed was 
Fukishima’s “neocognitron” (1979), which was based on the principle of 
convolutional neural networks and inspired by the mammalian visual cortex. 
The difference from the later contest-winning modern architectures was that 


Xll 


Deep Learning Architectures, A Mathematica1 Approach 


the weights were not found by a backpropagation algorithm, but by using a 
local, winner-take-all based learning algorithm. The backpropagation algo¬ 
rithm was used first time, without any reference to neural networks, by 
Linnainmaa in 1970s [77]. The first reference on backpropagation specific 
related to neural networks is contained in Werbos [123, 124]. 

The next period, from 1990 to 1995, is called connectionism and is char- 
acterized by training deep neural networks using backpropagation. The 
method of backpropagation was described first in mid-1980s by Parker [93], 
LeCun [73], and Rumelhart, Hinton, and Williams [108], and it is stili the 
most important procedure to train neural networks. 

In late 1980s several representation theorems dealing with neural networks 
as universal approximators were found, see Hornik et al. [89] and Cybenko 
[30]. The results state that a feedforward neural network with one hidden 
layer and linear output layer can approximate with any accuracy any con- 
tinuous, integrable, or measurable function on a compact set, provided there 
are enough units in the hidden layer. Other theoretically comforting results 
state the existence of an exact representation of continuous functions, see 
Sprecher [115], Kolmogorov [64], and Irie [60]. One of the main points of the 
present book is to present this type of representation theorems. 

During 1990s important advances have taken place in modeling sequences 
with recurrent neural networks with later application to natural language 
Processing. These can be mostly attributed to the introduction of the long 
short-term memory (LSTM) by Hochreiter and Schmidhuber [56] in 1997. 

Other novel advances occurred in computer vision, being inspired by the 
mammalian visual cortex mechanism, and can be credited to the use of 
convolutional neural networks, see LeCun et al. [74, 75]. This marked the 
beginning of the deep learning period, which will briefly present below. 

The deep learning pioneering started with LeNet (1998), a 7-level convo¬ 
lutional network, which was used for handwritten digit classification that was 
applied by several banks to recognize handwritten numbers on checks. After 
this the models became deeper and more complex, with an enhanced ability 
to process higher resolution images. 

The first famous deep network in this direction was AlexNet (2012), which 
competed in the ImageNet Large Scale Visual Recognition Challenge 
(ILSVRC), significantly outperforming all the prior competitors and 
achieving an error of 15.3%. Just a year later, in 2013, the ILSVRC winner 
network was also a convolutional network, called ZFNet, which achieved an 
error rate of 14.8%. This network maintained a similar structure with the 
AlexNet and tweaked its hyperparameters. 

The winner of the ILSVRC 2014 competition was GoogLeNet from Google, 
a 22-layer deep network using a novel inception module, which achieved an 
error rate of 6.67%, being quite close to the human-level performance. 
Another 2014 ILSVRC top 5 competition winner was a network with 16 
convolutional layers, which is known as VGGNet (standing for Visual 
Geometry Group at University of Oxford). It is similar to AlexNet but has 



Foreword 


Xlll 


some new standards (all its filters have size 3x3, max-poolings are placed 
after each 2 convolutions, and the number of filters is doubled after each 
max-pooling). Nowadays it is the most preferred choice for extracting fea- 
tures from images. 

ResNet (Residual Neural Network) was introduced in 2015. Its novel 
152-layer architecture has a strong similarity with RNNs and a novel feature 
of skipping connections, while using a lower complexity than VGGNet. It had 
achieved an error rate of 3.57%, which beats human-level performance on 
that data set, which is of about 5.1%. 

In conclusion, the idea of teaching computers how to learn a certain task 
from raw data is not new. Even if the concepts of neuron, neural network, 
learning algorithm, etc. have been created decades ago, it is only around 2012 
when Deep Learning became extremely fashionable. It is not a coincidence 
that this occurred during the age of Big Data, because these models are very 
hungry for data, and this requirement could be satisfied only recently when 
the GPU technology was developed enough. 

Besides the presentation, this book contains a few more novel features. For 
instance, the increasing resolution method in sect 4.10, stochastic search in 
section 4.13, continuum input neuron in sect 5.8, continuum deep networks in 
sects 10.6 and 10.9, and information representation in a neural network as a 
sigma-algebra in Part III, to enumerate only a few, are all original 
contributions. 

Acknowledgements 

The work on this book has started while the author was a Visiting 
Professor at Princeton University, during the 2016-2017 academic year, and 
then continued during 2018-2019, while the author was supported by a 
sabbatical leave from the Eastern Michigan University. I henceforth express 
my gratitude to both Princeton University, where I first got exposed to this 
fascinating subject, and Eastern Michigan University, whose excellent con- 
ditions of research are truly appreciated. Several chapters of the book have 
been presented during the Machine Learning of Ann Arbor Group meetings, 
during 2017-2018. 

This books owes much of its clarity and quality to the many useful com- 
ments of several unknown reviewers, whose time and patience in dealing with 
the manuscript is very much appreciated. Finally, I wish to express many 
thanks to Springer and its editors, especially Donna Chernyk for making this 
endeavor a reality. 


Michigan, USA 
September 2019 


Ann Arbor 



Chapters Diagram 



XV 
























































Notations and Symbols 


The following notations have been frequently used in the text. 

Calculus 

Hf Hessian of / 

Jp{x ) Jacobian matrix of F : M n —> M n 

n -sL Partial derivative with respect to Xb 

L 2 [0,T] Squared integrable functions on [0,T] 

C(I n ) Continuous functions on I n 
C 2 (M n ) Functions twice differentiable with second derivative 
continuous 

Cq (M n ) Functions with compact support of class C 2 
\\ x \\ Euclidean norm (= yjx\ + • • • + x 2 ) 

II/IIl 2 The ^2-norm (= yj f(t) 2 dt) 

VC Gradient of the function C 

Linear Algebra 

G 

W 1 

det A 

A- 1 

PII 

I 

A*B 
a O b 


Vector (0, ...,0,1,0,..0) 
n-dimensional Euclidean space 
Determinant of matrix A 
Inverse of matrix A 
Norm of matrix A 
Unitary matrix 

Convolution of matrices A and B 
Hadamard product of vectors a and b 


XVII 



xviii 


Deep Learning Architectures, A Mathematica1 Approach 


Probability Theory 

Q 

uo 

X 

&(X) 

Xt 

Tt 

W t 
E(X) 
E[X t \X 0 = x} 
E{X\T) 
Var(X) 
E p [-] 
Var(X) 
cov(X, Y) 
corr(X, Y ) 


Probability space 
Sample space 

Element of the sample space 
Random variable 
Sigma-field generated by X 
Stochastic process 
Filtration 
Brownian motion 
The mean of X 

Expectation of X t , given Xq = x 
Conditional expectation of X, given T 
Vari at ion of X 

Expectation operator in the distribution P 
Variance of the random variable X 
Covariance of X and Y 
Correlation of X and Y 
Normally distributed with mean 
p and variance a 1 


Measure Theory 


In 

T 




In 

M(I n ) 



p(pc)dv{x) 


Information Theory 

S(p,q) 

Dkl(p\W ) 

H(p) 

H(X) 
H(A,n) 
H(X\Y) 
I(X\Y), I(X, Y) 


The characteristic function of A 

Sigma-field 

Measures 

The n-dim hypercube 

Space of finite signed Baire measures on I n 
Integration of p(x) in measure v 


Cross-entropy between p and q 
Kullback-Leibler divergence between p and q 
Shannon entropy of probability density p 
Shannon entropy of random variable X 
Entropy of partition A and measure p 
Conditional entropy of X given Y 
Mutual information of X and Y 



Notations and Symbols 


xix 


Differential Geometry 

dist( z, 5) 

U,V 

Qij 

Lij 

F. . -pfc 
1 ijki 1 zj 

IVV VpV 
c(Z) 
c(t) 


Neuromanifold 
Distance from point z to S 
Vector fields 

Coefficients of the first fundamental form 
Coefficients of the second fundamental form 
Christoffel coefficients of the first and second 
kind 

Derivation of vector field V in direction U 
Velocity vector along curve c(t) 

Acceleration vector along curve c(t) 


Neural Networks 

<p(x) Activation function 
o(x) Logistic sigmoid function 
t(x) Hyperbolic tangent 
sp(x) Softplus function 
ReLU(x) Rectified linear units 
w,b Weights and biasses 
C(w,b) Cost function 

<5^ Dirae’s distribution sitting at Xq 

f w fi Input-output function 
rj,d Learning rates 

Input layer of a net 
X (O, Y Output layer of a net 
, Y Output of the £th layer 
pph) Weight matrix for the £th layer 
Bias vector for the £th layer 
Vd Backpropagated delta error in the ^th layer 

3 

jh) Information generated by the £th. layer 
£(d Lost information in the £th layer 



Contents 


F oreword v 

Chapters Diagram xv 

Notations and Symbols xvii 

Part I Introduction to Neural Networks 

1 Introductory Problems . 3 

1.1 Water in a Sink. 3 

1.2 An Electronic Circuit. 5 

1.3 The Eight Rooks Problem. 7 

1.4 Biological Neuron. 9 

1.5 Linear Regression. 11 

1.6 The Cocktail Factory Network. 14 

1.7 An Electronic Network. 15 

1.8 Summary. 17 

1.9 Exercises. 18 

2 Activation Functions . 21 

2.1 Examples of Activation Functions. 21 

2.2 Sigmoidal Functions. 32 

2.3 Squashing Functions. 36 

2.4 Summary. 37 

2.5 Exercises. 38 

3 Cost Functions . 41 

3.1 Input, Output, and Target. 41 

3.2 The Supremum Error Function. 42 

3.3 The L 2 -Error Function. 42 

3.4 Mean Square Error Function. 43 


XXI 























xxii Deep Learning Architectures, A Mathematica! Approach 

3.5 Cross-entropy . 46 

3.6 Kullback-Leibler Divergence. 48 

3.7 Jensen-Shannon Divergence. 51 

3.8 Maximum Mean Discrepancy. 51 

3.9 Other Cost Functions. 54 

3.10 Sample Estimation of Cost Functions. 55 

3.11 Cost Functions and Regularization. 58 

3.12 Training and Test Errors. 59 

3.13 Geometric Significance. 63 

3.14 Summary. 65 

3.15 Exercises. 66 

4 Finding Minima Algorithms. 69 

4.1 General Properties of Minima. 69 

4.1.1 Functions of a real variable. 69 

4.1.2 Functions of several real variables. 70 

4.2 Gradient Descent Algorithm. 73 

4.2.1 Level sets. 73 

4.2.2 Directional derivative. 81 

4.2.3 Method of Steepest Descent. 82 

4.2.4 Line Search Method. 86 

4.3 Kinematic Interpretation. 90 

4.4 Momentum Method. 93 

4.4.1 Kinematic Interpretation. 94 

4.4.2 Convergence conditions. 98 

4.5 AdaGrad. 100 

4.6 RMSProp. 101 

4.7 Adam. 103 

4.8 AdaMax. 104 

4.9 Simulated Annealing Method. 104 

4.9.1 Kinematic Approach for SA. 105 

4.9.2 Thermodynamic Interpretation for SA. 107 

4.10 Increasing Resolution Method . 109 

4.11 Hessian Method. 113 

4.12 Newton’s Method. 115 

4.13 Stochastic Search. 116 

4.13.1 Deterministic variant. 116 

4.13.2 Stochastic variant. 117 

4.14 Neighborhood Search. 121 

4.14.1 Left and Right Search. 121 

4.14.2 Circular Search . 122 

4.14.3 Stochastic Spherical Search. 124 

4.14.4 From Local to Global. 126 












































Contents xxiii 

4.15 Continuous Learning. 127 

4.16 Summary. 129 

4.17 Exercises. 130 

5 Abstract Neurons. 133 

5.1 Definitiori and Properties. 133 

5.2 Perceptron Model. 135 

5.3 The Sigmoid Neuron. 142 

5.4 Logistic Regression. 145 

5.4.1 Default probability of a company. 145 

5.4.2 Binary Classifier. 148 

5.4.3 Learning with the square difference cost 

function. 151 

5.5 Linear Neuron. 152 

5.6 Adaline. 157 

5.7 Madaline. 158 

5.8 Continuum Input Neuron. 159 

5.9 Summary. 162 

5.10 Exercises. 163 

6 Neural Networks. 167 

6.1 An Example of Neural Network. 167 

6.1.1 Total variation and regularization. 172 

6.1.2 Backpropagation. 173 

6.2 General Neural Networks. 178 

6.2.1 Forward pass through the network. 179 

6.2.2 Going backwards through the network. 179 

6.2.3 Backpropagation of deitas. 181 

6.2.4 Concluding relations. 183 

6.2.5 Matrix form. 183 

6.2.6 Gradient descent algorithm. 186 

6.2.7 Vanishing gradient problem. 186 

6.2.8 Batch training. 188 

6.2.9 Definition of FNN. 189 

6.3 Weights Initialization. 191 

6.4 Strong and Weak Priors. 195 

6.5 Summary. 196 

6.6 Exercises. 197 





































XXIV 


Deep Learning Architectures, A Mathematical Approach 


Part II Analytic Theory 

7 Approximation Theorems. 201 

7.1 Dini’s Theorem. 201 

7.2 Arzela-Ascoli’s Theorem . 203 

7.3 Application to Neural Networks. 207 

7.4 Stone-Weierstrass’ Theorem. 208 

7.5 Application to Neural Networks. 211 

7.6 Wiener’s Tauberian Theorems. 213 

7.6.1 Learning signals in L 1 (M). 213 

7.6.2 Learning signals in L 2 (M). 214 

7.7 Contraction Principle. 215 

7.8 Application to Recurrent Neural Nets. 217 

7.9 Resonance Networks. 222 

7.10 Summary. 223 

7.11 Exercises. 223 

8 Learning with One-dimensional Inputs. 227 

8.1 Preliminary Results. 227 

8.2 One-Hidden Layer Perceptron Network. 230 

8.3 One-Hidden Layer Sigmoid Network. 232 

8.4 Learning with ReLU Functions . 237 

8.5 Learning with Softplus. 242 

8.6 Discrete Data. 247 

8.7 Summary. 248 

8.8 Exercises. 249 

9 Universal Approximators. 251 

9.1 Introductory Examples. 251 

9.2 General Setup. 252 

9.3 Single Hidden Layer Networks. 255 

9.3.1 Learning continuous functions / E C(I n ) . 255 

9.3.2 Learning square integrable 

functions / G L 2 (/ n ). 261 

9.3.3 Learning integrable functions / E L l (I n ) . 266 

9.3.4 Learning measurable functions / E Al(M n ). 268 

9.4 Error Bounds Estimation. 277 

9.5 Learning g-integrable functions, / E L q (R n ) . 278 

9.6 Learning Solutions of ODEs. 280 

9.7 Summary. 282 

9.8 Exercises. 282 






































Contents xxv 

10 Exact Learning. 285 

10.1 Learning Finite Support Functions. 285 

10.2 Learning with ReLU . 287 

10.2.1 Representation of maxima. 288 

10.2.2 Width versus Depth. 292 

10.3 Kolmogorov-Arnold-Sprecher Theorem. 296 

10.4 Irie and Miyake’s Integral Formula. 297 

10.5 Exact Learning Not Always Possible. 300 

10.6 Continuum Number of Neurons. 300 

10.7 Approximation by Degenerate Kernels. 306 

10.8 Examples of Degenerate Kernels. 308 

10.9 Deep Networks. 310 

10.10 Summary. 311 

10.11 Exercises. 312 


Part III Information Processing 


11 Information Representation . 317 

11.1 Information Content . 318 

11.2 Learnable Targets. 321 

11.3 Lost Information. 325 

11.4 Recoverable Information. 327 

11.5 Information Representation Examples. 330 

11.5.1 Information for indicator functions. 330 

11.5.2 Classical perceptron. 332 

11.5.3 Linear neuron. 333 

11.5.4 Sigmoid neuron. 333 

11.5.5 Aret angent neuron. 334 

11.5.6 Neural network of classical perceptrons. 334 

11.5.7 Triangle as a union of rectangles. 336 

11.5.8 Network of sigmoid neurons. 336 

11.5.9 More remarks. 338 

11.6 Compressible Layers . 339 

11.7 Layers Compressibility. 342 

11.8 Functional Independence. 343 

11.9 Summary. 346 

11.10 Exercises. 346 

12 Information Capacity Assessment . 351 

12.1 Entropy and Properties. 351 

12.1.1 Entropy of a random variable. 352 

12.1.2 Entropy under a change of coordinates. 353 








































xxvi Deep Learning Architectures, A Mathematica! Approach 

12.2 Entropy Flow. 356 

12.3 The Input Layer Entropy. 358 

12.4 The Linear Neuron. 360 

12.5 The Autoencoder. 361 

12.6 Conditional Entropy . 363 

12.7 The Mutual Information. 364 

12.8 Applications to Deep Neural Networks. 372 

12.8.1 Entropy Flow. 372 

12.8.2 Noisy layers. 374 

12.8.3 Independent layers. 375 

12.8.4 Compressionless layers. 376 

12.8.5 The number of features. 377 

12.8.6 Total Compression. 380 

12.9 Network Capacity. 381 

12.9.1 Types of capacity. 382 

12.9.2 The input distribution. 383 

12.9.3 The output distribution. 383 

12.9.4 The input-output tensor. 384 

12.9.5 The input-output matrix. 384 

12.9.6 The existence of network capacity. 385 

12.9.7 The Lagrange multiplier method. 387 

12.9.8 Finding the Capacity. 390 

12.9.9 Perceptron Capacity. 390 

12.10 The Information Bottleneck. 393 

12.10.1 An exact formal solution . 395 

12.10.2 The information plane. 398 

12.11 Information Processing with MNIST . 400 

12.11.1 A two-layer FNN. 402 

12.11.2 A three-layer FNN. 405 

12.11.3 The role of convolutional nets. 406 

12.12 Summary. 409 

12.13 Exercises. 409 

Part IV Geometric Theory 

13 Output Manifolds . 417 

13.1 Introduction to Manifolds. 417 

13.1.1 Intrinsic and Extrinsic. 422 

13.1.2 Tangent space. 422 

13.1.3 Riemannian metric. 423 

13.1.4 Geodesics. 424 

13.1.5 Levi-Civita Connection. 425 

13.1.6 Submanifolds. 426 










































Contents xxvii 

13.1.7 Second Fundamental Form. 427 

13.1.8 Mean Curvature Vector Field. 429 

13.2 Relation to Neural Networks. 430 

13.3 The Parameters Space. 431 

13.4 Optimal Parameter Values. 436 

13.5 The Metric Structure. 439 

13.6 Regularization. 443 

13.6.1 Going for a smaller dimension. 444 

13.6.2 Norm regularization. 444 

13.6.3 Choosing the flattest manifold. 445 

13.6.4 Model averaging. 453 

13.6.5 Dropout. 454 

13.7 Summary. 462 

13.8 Exercises. 462 

14 Neuromanifolds. 465 

14.1 Statistical Manifolds. 466 

14.2 Fisher Information. 469 

14.3 Neuromanifold of a Neural Net. 473 

14.4 Fisher Metric for One Neuron. 475 

14.5 The Fisher Matrix and Its Inverse. 478 

14.6 The Fisher Metric Structure of a Neural Net. 483 

14.7 The Natural Gradient. 487 

14.8 The Natural Gradient Learning Algorithm. 489 

14.9 Log-likelihood and the Metric. 493 

14.10 Relation to the Kullback-Leibler Divergence. 495 

14.11 Simulated Annealing Method. 499 

14.12 Summary. 500 

14.13 Exercises. 501 

Part V Other Architectures 

15 Pooling. 507 

15.1 Approximation of Continuous Functions. 507 

15.2 Translation Invariance. 510 

15.3 Information Approach. 512 

15.4 Pooling and Classification . 514 

15.5 Summary. 515 

15.6 Exercises. 515 

16 Convolutional Networks. 517 

16.1 Discrete One-dimensional Signals. 517 

16.2 Continuous One-dimensional Signals. 519 








































XXV111 


Deep Learning Architectures, A Mathematical Approach 


16.3 Discrete Two-dimensional Signals. 520 

16.4 Convolutional Layers with 1-D Input. 522 

16.5 Convolutional Layers with 2-D Input. 525 

16.6 Geometry of CNNs. 528 

16.7 Equivariance and Invariance. 529 

16.7.1 Groups. 529 

16.7.2 Actions of groups on sets. 530 

16.7.3 Extension of actions to functions. 532 

16.7.4 Definition of equivariance. 533 

16.7.5 Convolution and equivariance. 534 

16.7.6 Definition of invariance. 537 

16.8 Summary. 539 

16.9 Exercises. 540 

17 Recurrent Neural Networks. 543 

17.1 States Systems. 543 

17.2 RNNs . 546 

17.3 Information in RNNs. 547 

17.4 Loss Functions. 549 

17.5 Backpropagation Through Time. 550 

17.6 The Gradient Problems. 553 

17.7 LSTM Cells. 554 

17.8 Deep RNNs. 556 

17.9 Summary. 557 

17.10 Exercises. 557 

18 Classification. 561 

18.1 Equivalence Relations. 561 

18.2 Entropy of a Partition. 563 

18.3 Decision Functions. 565 

18.4 One-hot-vector Decision Maps. 567 

18.5 Linear Separability. 570 

18.6 Convex Separability. 575 

18.7 Contraction Toward a Center. 577 

18.8 Learning Decision Maps. 578 

18.8.1 Linear Decision Maps. 578 

18.8.2 Nonlinear Decision Maps. 585 

18.9 Decision Boundaries. 587 

18.10 Summary. 589 

18.11 Exercises. 589 








































Contents 


xxix 


19 Generative Models. 591 

19.1 The Need of Generative Models. 591 

19.2 Density Estimation. 592 

19.3 Adversarial Games. 595 

19.4 Generative Adversarial Networks. 598 

19.5 Generative Moment Matching Networks. 606 

19.6 Summary. 607 

19.7 Exercises. 607 

20 Stochastic Networks. 611 

20.1 Stochastic Neurons. 611 

20.2 Boltzmann Distribution. 614 

20.3 Boltzmann Machines. 617 

20.4 Boltzmann Learning. 621 

20.5 Computing the Boltzmann Distribution. 623 

20.6 Entropy of Boltzmann Distribution. 626 

20.7 Fisher Information. 626 

20.8 Applications of Boltzmann Machines. 628 

20.9 Summary. 633 

20.10 Exercises. 633 

Hints and Solutions. 637 

Appendix. 689 

A Set Theory. 691 

B Tensors. 693 

C Measure Theory. 695 

C.l Information and 0-algebras. 695 

C.2 Measurable Functions. 697 

C.3 Measures. 699 

C.4 Integration in Measure. 701 

C.5 Image Measures. 703 

C.6 Indefinite Integrals. 703 

C.7 Radon-Nikodym Theorem. 703 

C.8 Egorov and Luzin’s Theorems. 704 

C.9 Signed Measures. 705 



































XXX 


Deep Learning Architectures, A Mathematical Approach 


D Probability Theory. 709 

D.l General definitions. 709 

D.2 Examples. 710 

D.3 Expectation. 710 

D.4 Variance. 711 

D.5 Information generated by random variables. 712 

D.5.1 Filtrations. 713 

D.5.2 Conditional expectations. 714 

D.6 Types of Convergence. 715 

D.6.1 Convergence in probability. 715 

D.6.2 Almost sure convergence. 715 

D.6.3 i/-convergence. 715 

D.6.4 Weak convergence. 716 

D.7 Log-Likelihood Function. 717 

D. 8 Brownian Motion. 719 

E Functional Analysis. 721 

E. l Banach spaces. 721 

E.2 Linear Operators. 722 

E.3 Hahn-Banach Theorem. 723 

E.4 Hilbert Spaces. 723 

E.5 Representation Theorems. 724 

E. 6 Fixed Point Theorem. 727 

F Real Analysis. 729 

F. l Inverse Function Theorem. 729 

F.2 Differentiation in generalized sense. 730 

F. 3 Convergence of sequences of functions. 731 

G Linear Algebra. 733 

G. l Eigenvalues, Norm, and Inverse Matrix. 733 

G.2 Moore-Penrose Pseudoinverse. 736 

Bibliography. 741 

Index. 751 

































Part I 

Introduction to Neural Networks 



® 

Check for 
updates 


Chapter 1 

Introductory Problems 


This chapter introduces a few daily life problems which lead to the concept 
of abstract neuron and neural network. They are all based on the process of 
adjusting a parameter, such as a rate, a flow, or a current that feeds a given 
unit (tank, transistor, etc.), which triggers a certain activation function. The 
adjustable parameters are optimized to minimize a certain error function. At 
the end of the section we shall provide some conclusions, which will pave the 
path to the dehnition of the abstract neuron and neural networks. 


1.1 Water in a Sink 

A pipe is supplied with water at a given pressure P, see Fig. 1.1 a. The knob 
K can adjust the water pressure, providing a variable outgoing water flow at 
rate r. The knob infhiences the rate by introdncing a nonnegative control w 
such that P — r/w. 

A certain number of pipes of this type are used to pour water into a tank, 
see Fig. 1.1 b. Simultaneously, the tank is drained at a rate R. The goal of 
this problem is stated in the following: 

Given the water pressure supplies Pi,..., P n , how can one adjust the knobs 
K \,..., K n such that after an a priori fixed interval of time t there is a 
predetermined volume V of water in the tank? 

Denote by ri,..., r n the inflow rates adjusted by the knobs Ah,..., K n . 
If the outflow rate exceeds the total inflow rate, i.e., if R > XaLi r T then no 
water will accumulate in the tank, i.e., V = 0 at any future time. Otherwise, 
if the total inflow rate is larger than the outflow rate, i.e., if r i P A!, 

then the water accumulates at the rate given by the difference of the rates, 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_l 


3 



4 


Deep Learning Architectures, A Mathematical Approach 



Figure 1.1: a. The knob K adjusts the pressure P through a pipe to an outflow 
rate r. b. The n knobs K \,..., K n adjust pressures through pipes that pour 
water into a tank with a sink hole having an outflow rate R . 


E n 

i= l r i 
(£?=i 

function 


n — 


R, accumulating over time t an amount of water given by V = 
R)t. The resulting water amount can be written as a piecewise 


V = 


0 . 


if R>TTi=iri 


(527=1 Ti ~ otherwise. 


We shall write this function in an equi valent way. First, let denote the 
control provided by the knob FQ, so the zth pipe supplies water at a rate 
ri — PiW\. Consider also the activation function 1 


( 0, if x < 0 

[ xt, otherwise, 


which depends on the time parameter t > 0. Now, the volume of water, V, 
can be written in terms of the pressures Pi, Controls Wi, and function <p t as 

n 

V = <p t f T p jWj - R ). (1.1.1) 

i= 1 

x As we shall see in a future chapter, this type of activation function is known under the 
name of ReLU. 



























Introductory Problems 


5 


The rate R produces a horizontal shift for the graph of <pt and in the neural 
networks language it is called bias. The problem reduces now to hnd a solution 
of the equation (1.1.1), having the unknowns and R. In practice it would 
suffice to obtain an approximation of the solution by choosing the Controls 
Wi and the bias R such that the following proximity function 





2 


is minimized. Other proximity functions can be also considered, for instance, 



the inconvenience being that it is not smooth. Both functions L and L are 
nonnegative and vanish along the solution. They are also sometimes called 
error functions. 

In practice we would like to learn from available data. To this purpose, 
we are provided with N training data consisting in n-tuples of pressures and 
corresponding volumes, (P fc ,T4), 1 < k < N, with V kT — (Pf,..., P k ). 
Using these data, we should be able to forecast the volume V for any other 
given pressure vector P. The forecast is given by formula (1.1.1), provided the 
Controls and bias are known. These are obtained by minimizing the following 
cost function: 

1 N 2 

,...,Wn,R) = -J2( t Pt{'P kT ^-R)-Vk) , 

z k=1 

where w T — (rei,..., w n ). 



1.2 An Electronic Circuit 

The next example deals with an electronic Circuit. It is worthwhile to mention 
that a problem regarding pressure, P, rates, r, and knobs can be “translated” 
into electronics language, using voltage, V, current intensity, /, and resistors, 
P, respectively. The following correspondence table will be useful: 


Fluids 

pressure 

rate 

knob 

Electronics 

voltage 

intensity 

resistivity 


Consider the electronic Circuit provided in Fig. 1.2. Voltages aq,... ,x n 
applied at the input lines are transmitted through the variable resistors 









6 


Deep Learning Architectures, A Mathematical Approach 



Figure 1.2: The neuron implemented as an electronic Circuit. 


i?i,..., R n . The emerging currents are given by the Ohm’s law 



k = 1,..., n. 


For simplicity reasons, define the weights Wk — —— as the inverse of resis- 

Rk 

tances. Then the previous relation becomes = x^w^. 

The n currents converge at the node N. Another variable resistor, i4, 
connects the node N with the base. The current through this wire is denoted 
by I\). By KirchhofFs second law, the sum of the incoming currents at the node 
N is equal to the sum of the outgoing currents at the same node. Therefore, 
the current I that enters the transistor T is given by 


n n 

I ^ ^ J-k A ^ ^ %k^k A• 

k =1 k=1 


The transistor is an electronic component that acts as a switch. More pre- 
cisely, if the ingoing current I is lower than the transistor’s threshold 0, then 
no current gets through. Once the current I exceeds the threshold 0, the 
current gets through, eventually magnified by a certain factor. The output 
current y of the transistor T can be modeled as 


f 0, if I<0 
{ fc, if / > 0, 


for some constant k > 0. In this case the activation function is the step 
function 


Ve 0 ) = 


0, if x < 6 
1, if x > 6 


( 1 . 2 . 2 ) 






























Introductory Problems 


7 


1.0 


i 


2 


3 


Figure 1.3: The unit step function <p e (x) with 9 = 1. 


and the output can be written in the familiar form 

n n 

y = kt Pe 00 = k(f 0 {i -9) = kip 0 (y^ik-ib-e) =WE XkWk ~ P) > 

k =1 k =1 

where /3 = 1^ + 9 is considered as a bias. Note that ip 0 (x) the Heaviside 
function, or the unit step function, centered at the origin. For a Heaviside 
function centered at 6 = 1, see Fig. 1.3. 

The learning problem can be formulated as: Given a current z = z(x i,..., x n ) 
depending on the input voltages ..., adjust the resistances i?i,..., i? n , 
such that the output y approximates z in the least squares way, i.e., such 
that \[z — y ) 2 is minimum . 

Remark 1.2.1 Attempts to build special hardware for artificial neural net- 
works go back to the 1950s. We refer the reader to the work of Minsky 
[86], who implemented adaptive weights using potentiometers, Steinbuch, 
who built associative memories using resistance networks [117], as well as 
Widrow and Hoff [125], who developed an adaptive System specialized for 
signal processing, called the Adaline neuron. 


1.3 The Eight Rooks Problem 

A rook is a chess figure which is allowed to move both vertically and horizon- 
tally any nnmber of chessboard squares. The problem requires to place eight 
distinet rooks in distinet squares of an 8 x 8 chessboard. Assuming that all 
possible replacements are equally likely, we need to hnd the probability that 
all the rooks are safe from one another. Even if this problem can be solved 
in closed form by combinatorial techniques, we shall present in the following 
a computational approach (Fig. 1.4). 






8 


Deep Learning Architectures, A Mathematical Approach 



Figure 1.4: A solutiori for the rooks problem on a 4 x 4 chess board. 


Let Xij be the state of the (z, j)th square, which is 

f 1, if there is a rook at position (z, j); 
lJ \ 0, if there is no rook at that place. 

These States are used in the construction of the objective function as in the 
following. Since there is only one rook placed in the jth column, we have 
Y^i=i x ij ~ 1- The fact that this property holds for all columns can be stated 
equivalently in a variational way by requiring the function 

88 2 

Fi(xn, . . .,X88) = yt ( - l) • 

3 =1 i=1 

to be minimized. The minimum of F\ is reached when all squared expressions 
vanish and hence there is only one rook in each column. Applying a similarly 
procedure for the rows, we obtain that the function 

88 2 

F 2 {xu, .. .,X8s) = yt (Fhu - 1 ) • 

i= 1 3 = 1 


mnst also be minimized. 

The rooks problem has many Solutions, each of them corresponding to a 
global minimum of the objective function 

E(x 11 , . . . , X88) = Fi(xn, . . . , X88) + F 2 {x\\ , . . . , X88)- 

The minima of the previous function can be obtained using a neural network, 
called Hopfield network. This application will be accomplished in Chapter 20. 





























Introductory Problems 


9 



Figure 1.5: A neuron cell. 


1.4 Biological Neuron 

A neuron is a cell which consists of the following parts: dendrites , axon , and 
body-cell , see Fig. 1.5. The synapse is the connection between the axon of 
one neuron and the dendrite of another. The functions of each part is briefly 
described below: 

• Dendrites are transmission channels that collect information from the 
axons of other neurons. This is described in Fig. 1.6 a. The signal traveling 
through an axon reaches its terminal end and produces some Chemicals X{ 
which are liberated in the synaptic gap. These Chemicals are acting on the 
dendrites of the next neuron either in a strong or a weak way. The connection 
strength is described by the weight System see Fig. 1.6 b. 

• The body-cell collects all signals from dendrites. Here the dendrites 
activity adds up into a total potential and if a certain threshold is reached, 
the neuron hres a signal through the axon. The threshold depends on the 
sensitivity of the neuron and measures how easy is to get the neuron to fire. 

• The axon is the channel for signal propagation. The signal consists in 
the movement of ions from the body-cell towards the end of the axon. The 
signal is transmitted electrochemically to the dendrites of the next neuron, 
see Fig. 1.5. 

A schematic model of the neuron is provided in Fig. 1.6 b. The input 
information coming from the axons of other neurons is denoted by X{. These 
get multiplied by the weights modeling the synaptic strength. Each den¬ 
drite supplies the body-cell by a potential X{Wi. The body cell is the unit 
which collects and adds these potentials into Y^d=i x i w i an d then compares 
it with the threshold b. The outcome is as follows: 











10 


Deep Learning Architectures, A Mathematical Approach 





o 

i 


a b 

Figure 1.6: a. Synaptic gap. b. Schematic model of a neuron. 

• if Y^i =i x i w i > ^5 the neuron fires a signal through the axon, y — 1 . 

•«xr= =1 X{Wi < 6, the neuron does not fire any signal, i.e., y — 0. 

This can be written equivalently as 

n n 

y = Vb(Y J XiWi) = (fo ( XiWi - &), 
i —1 z=l 

where the activation fnnction ipb is the unit step function centered at 6 , see 
equation ( 1 . 2 . 2 ) and ipo is the Heaviside function. 

Bias elimination In order to get rid of the bias, a smart trick is used. The 
idea is to introdnce an extra weight, wq — 6 , and an extra constant signal, 
Xo — —1. Then the previous outcome can be written as 

n 

y = <A) (T x i w i) ■ 

i= 0 

The incoming signals xi and the synaptic weights Wi can be seen in Fig. 1.6 
b. The neuron body-cell is represented by the main circle, while its functions 
are depicted by the two symbols, E (to suggest the summation of incoming 
signals) and the unit step function (to emphasize the activation function) of 
the neuron. The outcome signal of the axon is represented by an arrow with 
the outcome i/, which is either 0 or 1. 

The neuron learns an output function by updating the synaptic weights 
Wi. Only one neuron cannot learn complicated functions, but it can learn a 
simple function like, for instance, the piecewise fnnction 


z(x i, x 2 ) 


0, if X 2 < 0 

1, if X 2 > 0, 












Introductory Problems 


11 



Figure 1.7: ^4 linear approximator neuron. 


which takes the value 1 on the upper half-plane and 0 otherwise. This function 
can be learned by choosing the weights w\ — 0, W 2 — 1, and bias 6 = 0. 
The neuron fires when the inequality x\W\ + X 2 W 2 > 6 is satished. This is 
equivalent to X 2 > 0, which is the equation of the upper half-plane. Hence, 
one neuron can learn the upper-half plane. By this we mean that the neuron 
can distinguish whether a point (aq,^) belongs to the upper-half plane. 
Similarly, one can show that a neuron can learn any half-plane. 

However, the neuron cannot learn a more complicated function, such as 


COi,x 2 ) 


0 , if x\ + x\ > 1 
1 , if x\ + x\ < 1, 


i.e., it cannot learn to decide whether a point belongs to the interior of the 
unit circle. 


1.5 Linear Regression 

Neural networks are not a brand new concept. They have first appeared under 
the notion of linear regression in the work of Gauss and Legendre around the 
1800s. The next couple of sections show that a linear regression model can 
be naturally interpreted as a simple “neural network”. 

The case of a real-valued function Consider a continuous function 
/ : K —> R m , with K C R n compact set. We would like to construet a linear 
function that approximates f(x) in the L 2 -sense using a neuron with a linear 
activation. The input is the continuous variable X — (aq,... ,x m ) E K and 
the output is the linear function L(X) — Xq=i a j x j + where aj are the 
weights and 6 is the neuron bias, see Fig. 1.7. 




12 


Deep Learning Architectures, A Mathematical Approach 


For the sake of simplicity we assume the approximation is performed near 
the origin, i.e., L(0) = /(0), fact that implies the bias value b — /(0). Since 
using a vertical shift we can always assume /(0) = 0, then it suffices to take 
b — 0, so the linear function takes the form L — X]j=i a j x j- 

Consider the cost function that measures the L 2 -distance between the 
neuron output and target function 


C(a) = - / (L(x) — f(x)) 2 dxi...dx n — - ( a j x j ~/(x)^ dx\. 

JK JK j_ i 


. dx 


n 


and compute its gradient 

dC 


da k 


P n 

/ x k E djXj — f(x)) dx i... dx n 
' K K j =i 


n 


^ ^ (ijPjk rn k 

3 = 1 


1 < k < n. 


where 


pj k — / XjX k dx i... dx n) m k — / x k f{x) dx\... dx n . 

Jk Jk 


K JK 

In the equivalent matrix form this becomes 


V a C = pa-m, 


where V a C = , p = p T = p jk and m = (mi,.. .,m n ) T . The 

optimal value of weights are obtained solving the Euler equation V a C = 0. 
We obtain a — p -1 m, provided the matrix p is invertible, see Exercise 1.9.4. 
This means 

n 

aj = ^2g jk m kl 1 <j<n, 
k=l 


where g 1 = (g^ k ). 


Remark 1.5.1 In order to avoid computing the inverse p -1 , we can employ 
the gradient descent method, which will be developed in Chapter 4. Let 
a k (0 ) be the weight initialization (for instance, either uniform or normally 
distributed with mean zero and a small Standard distribution, as it shall 
be described in section 6.3). Then it can be shown that the sequence of 
approximations of the weight vector a is defined recursively by 


a k(t + 1) 


a k {t) - A 


dC 

da k 


ak(t) - A(y>(t)^ - m k ), 

3 = 1 


1 < k < n, 


where A > 0 is the learning rate. 






Introductory Problems 


13 


x, 


x^ 


X, 





O 


a j2 Xj + b 2 


- L m =Yj j=l a jmXj+b m 


Figure 1.8: A linear approximator neural net. 


The case of a multivalued function This section presents a neural net- 
work which learns the linear approximat ion of a multivalued function. Con- 
sider a continuous function F : R n —>► R m and we would like to approximate 
it around the origin by a linear function of type L(X) — AX + 6, where A is 
an m x n matrix and b G R m is a vector. 

We start by noting that L(X) is the output of a 2-layer neural net- 
work having linear activations on each output neuron. The inputs are X — 
(xi,..., x m ), the weight matrix is A — (a^), and the bias vector is given by 
b. Since the approximation is applied at X — 0, it follows that b — F( 0), 
which hxes the bias vector. The output of the kth neuron is given by the 
real-valued linear function Lj^(X) — i a jk x j + see Fig. 1.8. 

We assume the input X G R n be a continuous variable taking valnes in 
the compact set K C R n . In this case the cost function of the network is 
given by 


P m 

C(A) — / \\L(X)-F(X)\\ 2 dx 1 ...dx m = VC fc (X), 

Jk k =i 

where 

C k (A) = \ [ (L k (X) - F k (X)) 2 dxi... dx n , 
z Jk 

with F — (Ti,..., F m ). The minimization of the cost function C(A ) is 
equivalent with the simultaneous minimization of all cost fnnctions Cr(A), 
1 < k < m. Each of these cost fnnctions is relative to a neuron and we have 
shown their optimization procednre in the previous case. 






14 


Deep Learning Architectures, A Mathematical Approach 


_D~ x 


K 


* 


h 


3 

I 

max{wx — b, 0 ) 


Figure 1.9: A water tank with a knob and outflow (j){x) — ma x(wx — b, 0). 

The next two examples will introduce the idea of “deep learning” network, 
which is the subject of this book. 


1.6 The Cocktail Factory Network 

Assume there is a water inflow at rate r into a tank, which has a knob K 
that Controls the outflow from the tank, see Fig. 1.9. The knob is situated at 
a distance h above the bottom of the tank. After a given time, £, the amount 
of water that flowed in the tank is x — rt. Denote by w the parameter that 
Controls the outflow from the knob. If the level of the water inflow is below 
/i, no water will flow out. If the level of the water inflow is above /i, a fraction 
w of the overflow will flow out of the tank. If A is the tank base area, then 
x — Ah represents the overflow and the amount of w(x — Ah) — wx — b will 
flow through the knob, where b — wAh. Then the tank outflow is modeled 
by the hockey-stick function 


4>(x) = 


0 , if x < b/w 

wx — b, otherwise. 


In the following we shall organize our tanks into a network of six tanks 
that mix three ingredients into a cocktail, see Fig. 1.10. Sugar, water, and 
wine are poured into tanks A, F>, and C. A mixture of contents of tanks A 
and B goes into tank D as sweet water, and a mixture of contents of tanks B 
and C flows into tank E , as diluted wine. Mixing the outflows of tanks D and 
E provides the content of Container F. The knob of Container F Controls the 
final cocktail production. This represents an example of a feedforward neural 
network with two hidden layers, formed by containers (A, £>, C) and (D, 
E). The input to the network is made through three pipes and the output 
through the knob of Container F. 
















Introductory Problems 


15 






*3 r c 


A 



C 


output 


Figure 1.10: A cocktail factory as a neural network with two hidden layers. 


The final production of cocktail is the resuit of a complex combination of 
the input ingredients, sugar, water, and wine, poured into quantities aq, X 2 , 
and x%. The knobs’ control the proportion of solution allowed to pass to the 
next Container, and are used to adjust the quality of the cocktail production. 
After tasting the final outcome, a cocktail expert will adjust the knobs such 
that the cocktail will improve its quality as much as possible. The knobs’ 
adjusting process corresponds to learning of how to make cocktail from the 
given ingredients. 


1.7 An Electronic Network 

We can stack together electronic devices like the ones described in Fig. 1.2 
to construet more complex electronic networks. An example of this type of 
network is given in Fig. 1.11. It contains six transistors, three in the hrst 
layer, two in the second, and one in the output layer. There are n current 
inputs, aq,..., x n , and 3 n + 14 variable resistors. 

Since there is a transistor, T, in the last layer, the output y ou t is Boolean, 
i.e., it is 1 if the current through T is larger than the transistor’s threshold, 
and 0 otherwise. This type of electronic network can be used to learn to clas- 
sify the inputs aq,..., x n into 2 classes after a proper tuning of the variable 
resistors. If the inputs aq,..., x n are connected to n photocells, the network 
can be used for image classification with two classes. A similar idea using 






































































16 


Deep Learning Architectures, A Mathematical Approach 



Figure 1.11: Neural network with two hidden layers and 6 transistors. Each 
transistor implements a logic gate. 


custom-built hardware for image recognition was used in the late 50s by 
Rosenblatt in the construction of his perceptron machine, see [102]. 

The last two examples provided a brief idea about what a “deep neural 
network” is. Next we shall bring up some of the “questions” regarding deep 
neural networks, which will be addressed mathematically later in this book. 

(i) What classes of functions can be represented by a deep neural network? 
It will be shown in Part II that this class is quite rich, given the network has 
at least one hidden layer. The class contains, for instance, any continuous 
functions, or integrable functions, provided the input data belongs to a com- 
pact set. This the reason why neural networks deserve the name of “universal 
function approximators”. 

(n) How are the optimal parameters of a deep neural network estimated? 
Tuning the network parameters (the knobs or the variable resistors in our 
previous examples) in order to approximate the desired function corresponds 
to a learning process. This is accomplished by minimizing a certain distance 
or error function, which measures a proximity between the desired function 
and the network output. One of the most used learning method is the gradient 
































































































Introductory Problems 


17 


descent method, which changes parameters in the direction where the error 
function decreases the most. Other learning methods will be studied in Part I, 
Chapter 4. 

(iii) How can we measure the amount of Information which is lost or com- 
pressed in a deep learning network? 

Neural networks are also information processors. The information contained 
in the input is compressed by the hidden layers until the output layer pro¬ 
duces the desired information. For instance, if a deep network is used to 
classify whether an animal is a eat or a dog, the network has to process the 
input information until extracts the desired information, which corresponds 
to 1 bit (eat or dog). This type of study is done in Part III. 

1.8 Summary 

We shall discuss here all the common features of previous examples. They 
will lead to new abstract concept, the perceptron, and the neural network. 

All examples start with some incoming information provided by the input 
variables aq,..., x n . The way they have been considered in the previous exam¬ 
ples are as deterministic variables, i.e., one can measure clearly their values, 
and these do not change from one observation to another. The analysis also 
makes sense in the case when the input variables X{ are random, i.e., they 
have different values for distinet observations, and these values are distributed 
according to a certain density law. Even more general, the input information 
can be supplied by a stochastic process a^(£), i.e., a seqnence of n-uples of 
random variables parametrized over time t. 

Another common feature is that the inputs xi are multiplied by a weight 
Wi and then the products are added up to form the sum XaLi x i w i > which is 
the inner product between the input vector x and weight vector w. We note 
that in most examples a bias b is subtracted from the aforementioned inner 
product. If a new weight wq — b and a new input xq — — 1 are introdnced, 
this leads to Y^i=i x i w i ~ b — Y^i=o x i w ii which is stili the inner product of 
two vectors that extend the previous input and weight vectors. 

Another common ingredient is the activation function, denoted by (p(x). 
We have seen in the previous sections several examples of activation functions: 
linear, unit step functions, and piecewise-linear; each of them describes a spe- 
cific mode of action of the neuronal unit and suits a certain type of problem. 

The output function y is obtained applying the activation function ip to 
the previous inner product, leading to the expression y — ^ 

the input variables X{ are deterministic, then the output y is also deterministic. 


18 


Deep Learning Architectures, A Mathematical Approach 


However, if the inputs are random variables, the output is also a random 
variable. 

The goal of all the presented examples is to adjust the System of weights 
{ w i}i=0n suc h that the outcome y approximates a certain given function as 
good as possible. Even if this can be sometimes an exact match, in most cases 
it is just an approximation. This is achieved by choosing the weights such 
that a certain proximity function is minimized. The proximity function is a 
distance-like function between the realized output and the desired output. 
For the sake of simplicity we had considered only Euclidean distances, but 
there are also other possibilities. 

Most examples considered before contain the following ingredients: inputs 
{xi}, weights, {wi}, bias, 6, an activation function, <p(x), and an error func¬ 
tion. These parts are used to construet a function approximator, by the pro- 
cess of tuning weights, called learning from observational data. An abstract 
concept, which enjoys all the aforementioned properties and contains all these 
parts, will be introdnced and studied from the abstract point of view in 
Chapter 5. 

A few examples use a collection of neurons structured on layers. The 
neurons in the input layer get activated from the sensors that perceive the 
environment. The other neurons get activated from the weight connections 
from the previous neuron activations. Like in the case of a single neuron, 
we would like to minimize the proximity between the network output and 
a desired resuit. Neural networks will be introdnced and studied from their 
learning perspective in Chapter 6. 


1.9 Exercises 

Exercise 1.9.1 A factory has n suppliers that produce quantities aq,..., x n 
per day. The factory is connected with suppliers by a System of roads, which 
can be used at variable capacities ci,..., c n , so that the factory is supplied 
daily the amount x — c\X\ + • • • + c n x n . 

(a) Given that the factory prodnetion process starts when the supply reaches 
the critical daily level 6, write a formula for the daily factory revenne, y. 

( b ) Formulate the problem as a learning problem. 

Exercise 1.9.2 A number n of financial institutions, each having a wealth 
Xi, deposit amounts of money in a fund, at some adjustable rates of deposit 
Wi, so the money in the fund is given by x — x\w\ + • • • + x n w n . The fund 
is set up to function as in the following: as long as the fund has less than a 
certain reserve fund M, the fund manager does not invest. Only the money 


Introductory Problems 


19 


exceeding the reserve fund M is invested. Let k — e rt , where r and t denote 
the investment rate of return and time of investment, respectively. 

(a) Find the formula for the investment. 

(6) State the associate learning problem. 

Exercise 1.9.3 (a) Given a continuous function / : [0,1] —> R, find a linear 
function L{x) = ax + b with L(0) = /(0) and such that the quadratio error 
function ^ /q(L( x ) — /(x)) 2 dx is minimized. 

( b ) Given a continuous function / : [0,1] x [0,1] —> R, find a linear func¬ 
tion L(x,y) — ax + by + c with L(0, 0) = /(0,0) and such that the error 
^ Jj 0 1 j 2 (L(x, y) — /(x, y)) 2 dxdy is minimized. 

Exercise 1.9.4 For any compact set K C R n we associate the symmetric 
matrix 

Pij — / XiXj dx i ... dx n . 

Jk 

The invertibility of the matrix {pij) depends both on the shape of K and 
dimension n. 

(a) Show that if n = 2 then det pij ^ 0 for any compact set K C R 2 . 

( b ) Assume K — [0, l] n . Show that det pij ^ 0, for any n > 1. 



® 

Check for 
updates 


Chapter 2 

Activation Functions 


In order to learn a nonlinear target function, a neural network uses acti¬ 
vation functions which are nonlinear. The choice of each specific activation 
function defines different types of neural networks. Even if some of them 
have already been introduced in Chapter 1, we shall consider here a more 
systematic presentation. Among the zoo of activation functions that can be 
found in the literature, three main types are used more often: linear, step func¬ 
tions, sigmoid, and hockey-stick-type functions. Each of them has their own 
advantages (they are differentiable, bounded, etc.) and disadvantages (they 
are discontinuous, unbounded, etc.) which will rise in future computations. 


2.1 Examples of Activation Functions 


Linear functions The slope of a given line can be used to model the firing 
rate of a neuron. A linear activation function is mostly used in a multilayer 
network for the output layer. We note that computing with a neuron having 
a linear activation function is equivalent with a linear regression. We have 

• Linear Function This is the simplest activation function. It just multi- 
plies the argument by a positive constant, i.e., f(x ) = kx, k > 0 constant, 
see Fig. 2.1 a. In this case the firing rate is constant, f'(x ) = k. 

• Identity Function For k — 1 the previous example becomes the identity 
function f(x) — x, which was used in the linear neuron, see Fig. 2.1 b. 


Step functions This biologically inspired type of activation functions exhibit 
an upward jump that simplistically models a neuron activation. Two of them 
are described in the following. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_2 


21 



22 


Deep Learning Architectures, A Mathematical Approach 




Figure 2.1: Linear activation functions: a. f(x) = kx b. The identity function 
f(x) = X. 



a 


Figure 2.2: Step activation functions: 
tion. 



b 


. Heaviside function b. Signum func- 


• Threshold step function This function is also known under the equivalent 
names of binary step, unit step function, or Heaviside function, see Fig. 2.2 a. 
It is given by 


H{x) 


0 , if x < 0 
1 , if x > 0 


This activation function will be used later to describe a neuron that fires only 
for values x > 0. It is worth noting that the function is not differentiable at 
x — 0. However, in terms of generalized functions its derivative is given by 
the Dirac’s function, H\x) — 5(x), where 5{x) is the probability distribution 
measure for a unit mass centered at the origin. Roughly speaking, this is 



0 , if x 0 

Too, if x — 0 


having unit mass f R S(x) dx — 1. 

• Bipolar Step Function This function is also known under the name of 
signum function, see Fig. 2.2 b, and is defined as in the following: 


x This follows Schwartz’s distribution theory. 
















Activation Functions 


23 




Figure 2.3: Hockey-stick functions: a. y — ReLU(x) b. y = PReLU(a, x). 


S(x) — 


— 1, if x < 0 
1 , if x > 0 


It is not hard to see that S(x) relates to the Heaviside function by the linear 
relation 

S(x) — 2 H(x) — 1. 

Consequently, its derivative equals twice the Dirac’s function, 

S\x) = 2 H\x) = 2 S(x). 


Hockey-stick functions This class of functions has graphs that resemble a 
hockey stick or an increasing L-shaped curve. A few of them are discussed in 
the following. 

• Rectified Linear Unit (ReLU) In this case the activation is linear only 
for x > 0, see Fig. 2.3 a, and it is given by 


ReLU(x) — xH(x) — max{x,0} 


0 , if x < 0 
x, if x > 0. 


It is worth realizing that its derivative is the Heaviside function, ReLUfix) — 
H (: x ). The fact that ReLU activation function does not saturate, as the sig- 
moid functions do, was proved useful in recent work on image recognition. 
Neural networks with ReLU activation function tend to learn several times 
faster than a similar network with saturating activation functions, such as 
logistic or hyperbolic tangent, see [68]. A generalization of the rectified linear 
units has been proposed in [48] under the name of maxout. 

• Parametrio Rectified Linear Unit (PReLU) In this case the activation 
is piecewise linear, having different firing rates for x < 0 and x > 0, see 
Fig. 2.3 b: 


f ox, if x < 0 
\ x, if x > 0, 


PReLU (o, x) 


a > 0 . 









24 


Deep Learning Architectures, A Mathematical Approach 




Figure 2.4: Hockey-stick functions: a. y(x) — ELU (a, x) b. y(x) — 
SELU (cq A, x). 


It is worth realizing that its derivative is the Heaviside function, ReLU' [x ) = 
H{x). 

• Exponential Linear Units (ELU) This activation function is positive 
and linear for x > 4, and negative and exponential otherwise, see Fig. 2.4 a: 


ELU (cq x) 


x, if x > 0 

a(e x — 1), if x < 0 


Note that the function is differentiable at x — 0 only for the value a — 1; in 
this case the derivative equals the value 1. 

• Scaled Exponential Linear Units (SELU) This activation function is a 
scaling of the previous function, see Fig. 2.4 b: 


SELU (a, X,x) = A 


x, if x > 0 

a(e x — 1), if x < 0. 


• Sigmoid Linear Units (SLU) This one is obtained as the product between 
a linear function and a sigmoid 


(f)(x) — XCr(x) = 


X 


1 + e 


—X 


Unlike the other activation functions of hockey-stick type, this activation 
function is not monotonic, having a minimum, see Fig. 2.5 a. Its derivative 
can be computed using product rule as 



<j(x) + xa\x) — <j(x) + xcr(x) (l — cr(x)) 
<j(x) (l + XCr( — x)) . 


The critical point is located at xo = — u ~ —1.27, where u is the positive 

1 

solution of the equation a(u) — —, see Fig. 2.5 b. The SLU might perform 


u 

slightly better than the ReLU activation functions in certain cases. 








Activation Functions 


25 




Figure 2.5: a. The activation function (j)(x) — xa(x ) has a minimum at the 
point xq « —1.27. b. The graphical solution of the equation cr(x) = 1/x. 


This activation function reminds of a profile usually observed in nature, 
for instance, in heart beating. Before the heart emits the pressure wave, there 
is a slight decrease in the pressure, which corresponds to the dip in Fig. 2.5. 
A neuron might exhibit a similar behavior, fact that explains the good results 
obtained using this activation function. 

The size and the location of the dip can be controlled by a parameter 
c > 0 as 

4>c{x) = xa(cx) = —j _—, 

see Fig. 2.6. 

• Softplus The softplus function is an increasing positive function that 
maps the real line into (0, oo), which is given by 

sp{x) = ln(l + e x ). 


Its name comes from the fact that it is a smoothed version of the hockey- 
stick function = max{0,x}. Its graph is represented in Fig. 2.7 a. The 
decomposition of x into its positive and negative part, via the identity x — 
x + — x~, where x~ — max{0, — x} has in the case of the softplus function the 
following analog: 

sp(x) — sp{—x) — x. 

This follows from the next algebraic computation that uses properties of the 
logarithm 


sp{x) — sp(—x) — ln(l + e x ) — ln(l + e x ) — ln 


l-he x 


1 + e 


—X 


— ln 


e x (l + e x ) 
e x (l + e ~ x ) 


6 X (1 T 6 X ) 

= ln 1 + J = ln e x 
1 + e x 


= X . 


The graph of the function sp(—x ) is obtained from the graph of sp(x) by 
symmetry with respect to the y- axis and represents a softened version of the 
function x ~. 
















26 


Deep Learning Architectures, A Mathematical Approach 



Figure 2.6: The activation function cp c (x ) = xa(cx) — 1+( f- cx for c — 1,2, 3. 




a b 


Figure 2.7: a. Softplus function sp(x) — ln(l + e x ) b. Family of logistic func- 
tions approximating the step function. 

It is also worthy mentioning in the virtue of formula (2.1.1) the following 
relation with the logistic function: 

sp'(x ) = 1 _ > 0, (2.1.1) 

1 + e x 

which means that softplus is an antiderivative of the logistic function. Con- 
sequently, the softplus is invertible, with the inverse given by 

sp _1 (x) = log(e x — 1), x > 0. 


Sigmoid functions These types of activation functions have the advantage 
that they are smooth and can approximate a step function to any degree of 
accuracy. In the case when their range is [0,1], its value can be interpreted 
as a probability. We encounter the following types: 

• Logistic function with parameter c > 0 Also called the soft step function, 
this sigmoid function is defined by 


1 

1 + e“ cx 


CJ c (x ) = cr(c, x) 


1 















Activation Functions 


27 




a b 


l 


Figure 2.8: a. Logistic function a(x) = 1+e - x 
y — 0 and y — — 1. b. Hyperbolic tangent function. 


and its horizontal asymptotes, 


where the parameter c > 0 Controls the firing rate of the neuron, in the sense 
that large valnes of c correspond to a fast change of valnes from 0 to 1. In 
fact, the sigmoid function cr c (x) approximat es the Heaviside step function 
H (x) as c —> +oo, see Fig. 2.7 b. This follows from the observation that 


which leads to 


lim e" cx 

c—^oo 


0, if x > 0 

oo, if x < 0, 


lim cr c (x) — lim 


c —yoG 


c-^ oo 1 + e 


-cx 


1, if x > 0 
0, if x < 0 


= i7(x), x^O, 


At x = 0 the value of the sigmoid is independent of c, since cr c (0) = and 

1 

hence the limit at this point will be equal to lim cr c (0) = -. 

c—?► OO 2 

The logistic function a c maps monotonically the entire real line into the 

1 

unit interval (0,1). Since cr c (0) = the graph is symmetric with respect to 
the point (0,1/2), see Fig. 2.8 a. 

The sigmoid function satisfies a certain differential equation, which will 
be useful later in the backpropagation algorithm. Since its derivative can be 
computed using quotienfs rule as 


(j' c {x) = - 


(l + e~ cx y = 


c( 1 + e cx ) 


(1 + e -cx ) 2 ' ' " y 1 + e~ cx 

= c(o c (x) - al(x)) = ca c (x) (l - a c (x)) 


it follows that the rate of change of the sigmoid function a c can be represented 
in terms of u c as 

cr' = cct c (1 - a c ). 
















28 


Deep Learning Architectures, A Mathematical Approach 


It is worth noting that the expression on the right side vanishes for o c — 0 
or <j c — 1 , which correspond to the horizontal asymptotes of the sigmoid. 


If c — 1, we obtain the Standard logistic function, which will be used in the 
construction of the sigmoid neuron 


C7 



1 

l + e~ x ' 


A straightforward algebraic manipulation shows the following symmetry prop- 
erty of the sigmoid: 

a c (-x) = 1 - cr c (x). 


It is worth noting that the logistic function can be generalized to a 2- 
parameter sigmoid function by the relation 

1 


<7c,a{x) 


1 + e ~ cx ) 


OL 


c, a > 0. 


to allow for more sophisticated models. 

Sometimes the inverse of the sigmoid function is needed. It is called the 
logit function and its expression is given by 

c~ 1 (x) — l°g (^"17"^)’ ^ e (0,1). 


Note that the softplns function can be recovered from the logistic function as 

/ X 

<j(x)dx. (2.1.2) 

-oo 

This relation can be verified directly by integrat ion, or by checking the fol¬ 
lowing antiderivative conditions: 


lim sp(x) 

x^—oo 

sp\x ) 


lim ln(l + e x ) = ln 1 = 0 

X^ — OG 

(ln(l + eO) = 


• Hyperbolic tangent This is sometimes called the bipolar sigmoidal func¬ 
tion. It maps the real numbers line into the interval (—1,1), having horizontal 
asymptotes at y — ±1, see Fig. 2.8 b. The hyperbolic tangent is defined by 

t(x) — tanh x — -. 

v 7 e x + 

An algebraic computation shows the following linear relation with the logistic 
function: 








Activation Functions 


29 




a b 

Figure 2.9: a. Arctcmgent function b. Piecewise linear function with parame- 
tera. 

Differentiating yields 

t' (x) — 2(7 2 (x) = 4(J2(x)(1 — (72 (x)) 

— (1 + t(x))( 1 — t(x)) = 1 — t 2 (x), 

we arrive at the well-known differential equation 

t\x ) = 1 — t 2 (x), 

which will be useful in the backpropagation algorithm. We also make the 
remark that t(0) = 0 and the graph of t{pc) is symmetric with respect to the 
origin. The fact that t(x) is centered at the origin constitutes an advantage 
of using hyperbolic tangent over the logistic function a(x). 

• Arctangent function Another sigmoid function, which maps the entire 
line (—oo, oo) into (0,1), is the following arctangent function, see Fig. 2.9 a: 

2 i 

h(x ) = — tanh (x), x E M. 

7 T 


• Softsign function Also known as Hahn’s function, softsign is an increas- 
ing sigmoid function, mapping the entire real line into interval (—1,1). Its 
dehnition relation is given by 


so 



x 


1 + 


x 


x G M. 


Despite the absolute value term, the function is differentiable; however, it is 
not twice differentiable at x = 0. In spite of its similarities with the hyper- 
bolic function, softsign has tails that are quadratic polynomials rather than 
exponential, and hence it approaches asymptotes at a slower rate. Softsign 
was proposed as an activation function by Bergstra in 2009, see [16]. It has 















30 


Deep Learning Architectures, A Mathematical Approach 


been shown that the advantage of using softsign over the hyperbolic tangent 
is the fact that it saturates slower than the latter, see [44]. 

• Piecewise Linear This is a sigmoid fnnction, depending on a parameter 
a > 0 , which is dehned as 


- 1 , 

fa(x) = f(a,x) = { x/a, 


if x < —a 
if — a < x < a 

if x > a. 


Note that as a —> 0 then /(«, x) —> S(x), for x ^ 0 and that the graph has 
always two corner points, see Fig. 2.9 b. 

Most aforementioned functions have similar sigmoid graphs, with two 
horizontal asymptotes, to ±oo. However, the functions distinguish themselves 
by the derivative at x — 0 as in the following: 


c 


y(°) = ^ 


-'( 0 ) = 4 


t'( 0 ) = 1 h'{ 0 ) = 1 


so'{ 0 ) = 1 


/a(°) = -• 

a 


Roughly speaking, the larger the derivative at zero, the closer to a step fnnc¬ 
tion the sigmoid fnnction is. 

The logistic function a(x) encountered before satishes the following three 
properties: 

1 . nondecreasing, differentiable, with a'(x) > 0 ; 

2. has horizontal asymptote for x —> — oo with valne cr(-oo) — 0; 

3. has horizontal asymptote for x —> +oo with cr(+oo) = 1. 

These properties are typical for a cumulative distribution function of a random 
variable that takes valnes in (—oo,+oo). Consequently, the integral cr(x) — 
f-oo p( u ) d u °f any continnous, nonnegative, bumped-shaped function, p(x), 
is a sigmoidal function satisfying the previous properties. For instance, if we 
take p(x) — 5 then F{x) — ^ e~ l dt is a sigmoid function. 

Bumped-type functions These types of activation functions are used when 
the neuron gets activated to a maximum for a certain valne of the action 
potential; it can occur in a smooth bumped way, see Fig. 2.10 a, or with a 
bump of a thorn type, see Fig. 2.10 b. These are given by 
• Gaussian It maps the real line into (0,1] and is given by 


g(x) = e' 


X 


x G M. 


Double exponential It maps the real line into (0,1] and is dehned as 

f(x) — x G R, A > 0 . 




Activation Functions 


31 



a 


Figure 2.10: a. Gaussian function g(x) 
f{pc) — 



b 


2 

= e~ x b. Double exponential function 


Classification functions A one-hot vector with n components is a coordi- 
nate vector in R n of the form 

= ( 0 , 0 , 1 , 0 , ..., 0 ) 

with a 1 on the zth slot and zero in rest. These vectors are useful in classi¬ 
fication problems. If one would like to have the network output as close as 
possible to this type, then the activation function of the last layer should be 
a softmax function. If x E R n is a vector, then dehne the n-valued function 
softmax(y ) = z, with 

e Xj 


Usually, the L 1 -norm is considered, 
can be applied. 

We can also dehne the softmax function depending on a parameter c > 0 
by 


j = 1,... ,ra. 


,x 


= ^27= i e ^S but also other norms 


softmax c [x ) = 


,CX\ 


xx 


n 


,CXi 


E n . 5 • • • 5 , 

i=i e 1 z2i= i e 

We note that the usual softmax function is obtained for c — 1. However, if 
c -E oo, we obtain 

lim softmax c (x ) = (0 ,..., 1,..., 0) = e&, 

c—>• (X) 

where Xk — max{xi,..., x n }. We shall verify this for the case n — 2 only. 
The general case can be treated similarly. Assume x\ > X 2 - Then 


lim softmax c (x) — lim ( 

c—^oo c—^oo \ 


,CX 1 




— lim 


| gCX 2 gCXp | gCX 2 

i i 


c—^oo Vi + e c(x2~X\) ’ gC(xi—X 2 ) _|_ ]_ 


) = (1, 0)= ei . 



32 


Deep Learning Architectures, A Mathematical Approach 


Hence, the softmax function is a smooth softened version of the maximum 
function. 

In the following we shall present two generic classes of activation func- 
tions: sigmoidal and squashing functions. These notions are more advanced 
and can be omitted at a first reading. They will be useful later when dealing 
with neural nets as function approximators. The reader can find the measure 
theory background needed for the next section in sections C.3 and C.4 of the 
Appendix. 

2.2 Sigmoidal Functions 

This section introduces and studies the properties of a useful family of activa¬ 
tion functions. The results of this section will be useful in the approximation 
theorems presented in Chapter 9. 

Definitiori 2.2.1 A function a : R. — > [0, 1] is called sigmoidal if 

lim o{x ) — 0 , lim cr(x) — 1 . 

x^ — oo X—^+oo 

We note that the previous definition does not require monotonicity; however, 
it States that it suffices to have two horizontal asymptotes. The prototype 
example for sigmoidal functions is the logistic function. 

Recall that a measure p can be regarded as a System of beliefs used to asses 
information, so dp{x) — p[dx) represents an evaluation of the information 
cast into x. Consequently, the integral f f{x) dp(x) represents the evaluation 
of the function f(x ) under the System of beliefs p. 

The next definition introduces the notion of discriminatori/ function. This 
is characterized by the following property: if the evaluation of the neuron 
output, a(w T x + 0), over all possible inputs x, under the belief p vanishes 
for any threshold 9 and any system of weights, re, then p must vanish, i.e., 
it is a void belief. 

The next definition uses the concept of a signed Baire measure. The 
reader can peek in sections C.3 and C.9 of the Appendix to refresh the defi¬ 
nition of Baire measures and signed measures. The main reason for using the 
Baire measure concept is its good behavior with respect to compactly sup- 
ported functions (all compactly supported continuous functions are Baire- 
measurable and any compactly supported continuous function on a Baire 
space is integrable with respect to any finite Baire measure). 

More precisely, if we denote by I n — [0, l] n = [0,1] x • • • x [0,1] the n- 
dimensional unit cube in R n and by M(/ n ), the space of finite, signed regular 
Baire measures on / n , we have: 


Activation Functions 


33 


Definitiori 2.2.2 Let fi G M(I n ). A function a is called dis criminat ory for 
the measure fi if: 



a 


(w T x + d)d/i(x) — 0, 


Vw G R n ,V0 G R 



We note that the function a in the previous definition is not necessarily a 
logistic sigmoid function, but any function satisfying the required property. 

We shall denote by V w p — {x ; w T x + 0 = 0} the hyperplane with normal 
vector w and {n + l)-intercept 6. Consider also the open half-spaces 

Fo,e = = G; w T x + 9 > 0} 

'H.- q = {x]w t x + 9< 0 }, 

which form a partition of the space as R n = Fif + T~L~ nUV w e • The following 

(X/ ^ V/ IA/ l/ 5 

lemma is useful whenever we want to show that a function a is discriminatory. 


Lemma 2.2.3 Let fi G M(I n ). If /i vanishes on ali hyperplanes and open 
half-spaces in R n ; then p is zero. More precisely, if 

MWe) = 0 ) vOtwfi) = 0, Vw; E 6 e R, 


then (i — 0. 


Proof: Let w G R n be hxed. Consider the linear functional F : L°°(R) —> R 
dehned by 

F{h) = / h{w T x) d/i(x), 

j i n 


where L°°(R) denotes the space of almost everywhere bounded functions on 

R. 

The fact that F is a bounded functional follows from the ineqnality 


F(h) 



h{w T x) dfi{x) < \\h\\ 


oo 



Wn)\-\\h\ 


OO 5 


where we used that /i is a finite measure. 

Consider now h = 1 [ 51 , 00)5 he., h is the indicator function of the interval 
[0, 00 ). Then 

F{h) = / h{w T x) dfi{s) = / d^i(x) = /j,(V w -g) + h{U w -q) = 0. 

J I n J {w T X>6} 



34 


Deep Learning Architectures, A Mathematical Approach 


If h is the indicator function of the open interval (0,oo), i.e., h = 1(0,00), a 
similar computation shows 

F{h) — / h(w T x) dp(x) — / dp{x) — p{fH w ~e) — 0. 

JI n J{w T X>6} 

Using that the indicator of any interval can be written in terms of the afore- 
mentioned indicator functions, i.e., 


l[a,6] l[a,oo) 1(6,00)5 l(a,6) l(a,oo) 1 [6,00)5 etc ., 

it follows from linearity that F vanishes on any indicator function. Applying 
the linearity again, we obtain that F vanishes on simple functions 

N N 

F(YadP = y^/xFjljJ = 0 , 

i= 1 i= 1 

for any aj G R and Ji intervals. Since simple functions are dense in L°°(R), 
it follows that F — 0. More precisely, for any fixed / G L°°(R), by density 
reasons, there is a seqnence of simple functions (s n ) n such that s n -G /, as 
n —> oo. Since F is bounded, then it is continnous, and then we have 


L(f) = L(lim s n ) = limL(s n ) = 0. 

n n 

Now we compute the Fourier transform of the measure p as 


p(w) — I e lw x d/i{x) — I cos (w 1 x ) dp{x) Aii sin (w 1 x ) dp{x) 
j i n j i n j i n 

— F( cos(*)) + iF{ sin(*)) = 0, \/w G M n , 


T 


T 


since F vanishes on the bounded functions sine and cosine. From the injec- 
tivity of the Fourier transform, it follows that p — 0. The reader can hnd the 
the dehnition of the Fourier transform in the section D.6.4 of the Appendix. 

■ 

The next resuit presents a class of discriminatory functions, see [30]. 

Propositiori 2.2.4 Any continuous sigmoidal function is discriminatory for 
all measures p G M(I n ). 

Proof: Let p G M(I n ) be a fixed measure. Choose a continuous sigmoidal a 
that satisfies 


a(w T x + O)dp(x) — 0, 


Vw G R n , 9 G R. 


(2.2.3) 


Activation Functions 


35 


We need to show that fi — 0. First, construet the continuous function 

<j\(x) — g(\(w t x + 6) + 0 ) 

for given re, 6 and 0 , and use the definit ion of a sigmoidal to note that 

( 1 , if w T x + 6 > 0 
lim (J\{x) — < 0 , if w T x + 6 < 0 

\ cr(0), if ee T x + 0 = 0. 

Define the bounded function 


( l, if 

l(x) = < 0, if x € TL~ e 

{ cr((f)), if x € V w ,e 

and notice that g\(x) —> 7 ( 2 ?) pointwise on R, as A -G 00 . The Bounded 
Convergence Theorem allows switching the limit with the integral, obtaining 

lim / g \(x) dfi(x) — / 7 (x) dfi(x) 

X^rOO It It 

U l n >J l n 

— / y(x) d/x(x) + / 7 ( 2 ;) dfi(x) + / 7 ( 2 ;) dfi(x) 

J K,e J n~,e 

Eqnation (2.2.3) implies that / g\(x) dfi(x) — 0, and hence the limit in the 

t/ i n 

previous left term vanishes. Conseqnently, the right term mnst also vanish, 
fact that can be written as 

K n w,e) + = 0. 

Since this relation holds for any value of </>, taking </> —> +00 and using the 
properties of cr, yields 

0 . 

Similarly, taking </> -G — 00 , implies 

//(«+ 0 ) = 0, Vw G R n , 0 G R. (2.2.4) 

Note that, as a consequence of the last two relations, we also have /jl(V Wi q) — 
0 . Since FL+ 0 — FLZ W _£>, relation (2.2.4) States that the measure fx vanishes on 
all half-spaces of R n . Lemma 2.2.3 States that a measure with such property 
is necessarily the zero measure, /x — 0. Therefore, g is discriminatory. ■ 


Remark 2.2.5 1. The conclnsion stili holds true if in the hypothesis of the 
previous proposition we replace “continuous” by “bounded measurable”. 

2 . A discriminatory function is not necessarily monotonic. 


36 


Deep Learning Architectures, A Mathematical Approach 




a b 


Figure 2.11: Squashing functions: a. Ramp function b. Cosine squasher 
function. 

2.3 Squashing Functions 

Another slightly different type of activation function used in the universal 
approximators of Chapter 9 is given in the following. 

Definitiori 2.3.1 A function ip : R —>> [0,1] is a squashing function if: 

(i) it is nondecreasing, i.e., for x\,X 2 G R with x\ < X 2 , then <p{x\) < <p(x 2 ) 

(ii) lim (p(x) — 0, lim tp(x) — 1 . 

X^ — OG X^OG 

A few examples of squashing functions are: 

1. The step function H(x) — 1 [0,00) 0 * 0 - 

2. The ramp function (p(x) — xl^^(x) + l(i,oo )( x )5 see Fig- 2.11 a. 

3. All monotonic sigmoidal functions are squashing functions, the prototype 
being the logistic function. 

4. The cosine squasher of Gallant and White (1988) 



In the previous relation, the first term provides a function, which is nonzero 
on ( — f, §), while the second term is a step function equal to 1 on the right 
side of |. Their sum is a continuous function, see Fig. 2.11 b. 

Being monotonic, squashing functions have at most a countable number 
of discontinuities, which are of jump type. In particular, squashing functions 
are measurable with respect to the usual Borei cr-algebra. 

The next resuit States that we can always select a sequence of convergent 
squashing functions. 

Lemma 2.3.2 Let S be an infinite family of squashing functions. Then we 
can choose a sequence (ip n )n i n S such that p n converges at each point x in 
[0,1]. The limit is also a squashing function. 








Activation Functions 


37 


Proof: The family S is uniformly bounded with sup \f(x)\ < 1. Since 

x€[a,b] 

squashing functions are nondecreasing, they have finite total variation given 
by V^oo^) = — (p(—oo) = 1, for all cp G S. Applying Helley’s theo- 

rem, see Theorem E.5.4 in the Appendix, there is a sequence (<p n )n of func- 
tions in S that is convergent at each point x G [a, b\. Consider the limit 
function <p(x) = lim n ^ 00 p n (x) and show that p satisfies the properties of a 
squashing function. 

Let x\ < X 2 . Then p n {x{) < p n (x 2 ) for all n > 1. Taking the limit yields 


<p(xi) = lim ^n(^i) < lim <p n (x 2 ) = p(x 2 ) 

n—^oo n—> 00 


which means that p is nondecreasing. Interchanging limits, we also have 


<f(-oo) 
lp(+ 00) 


lim p{x) — lim lim p n ( x ) — lim lim p n (x) — 0 

x^—00 x^—00 n—t 00 n—>00 x^—00 

lim p(x) — lim lim p n (x) — lim lim p n {x) — 1, 

x—^+oo x^+00 n—>00 n—^oo x—^+00 


where we used that p n are squashing functions. ■ 

The next consequence States that the set of squashing functions is sequen- 
tially compact: 

Corollary 2.3.3 Any infinite sequence of squashing functions (p n )n contains 
a convergent subsequence (p nk )- 


Example 2.3.1 Consider the family of squashing functions 

s = { y VW^ ;ceM+ }- 

Then the sequence (p n )n, Pn — 1+e ~ na; is convergent to the squashing func¬ 
tion H(x) = l[ 0 ,oo)(^)- 


2.4 Summary 

Each unit of a neural net uses a function that processes the incoming signal, 
called the activation function. The importance of activation functions is to 
introduce nonlinearities into the network. Linear activations can produce 
only linear outputs and can be used only for linear classification or linear 
regression problems, from this point of view being very restrictive. Nonlinear 
activation functions hx this problem, the network being able to deal with 
more complex decision boundaries or to approximate nonlinear functions. 

There are many types of activation functions that have been proposed 
over time for different purposes, having diverse shapes, such as step functions, 






38 


Deep Learning Architectures, A Mathematical Approach 


linear, sigmoid, hockey stick, bumped-type, etc. The threshold step function 
was used as an activation function for the McCulloch-Pitts neuron in their 
1943 paper [82]. Due to its serious limitations including the lack of good train- 
ing algorithms, discontinuous activation functions are not used that often, 
and therefore, nonlinear activations functions have been proposed. 

The most common nonlinear activation functions used for neural networks 
are the Standard logistic sigmoid and the hyperbolic tangent. Their differ- 
entiability and the fact that their derivatives can be represented in terms of 
the functions themselves make these activation functions useful when apply- 
ing the backpropagation algorithm. It has been noticed that when training 
multiple-layer neural networks the hyperbolic tangent provides better results 
than the logistic function. 

The idea of using a ReLU as an activation function was biologically inspired 
by the observation that a cortical neuron is rarely in its maximum saturation 
regime and its activation increases proportionally with the signal, see [19] 
and [32]. The ReLU is, in this case, an approximation of the sigmoid. It has 
been shown that rectiher linear units, despite their non-differentiability at 
zero, are doing a better job than regular sigmoid activation functions for 
image-related tasks [90] and [45]. 

Other activation functions (sigmoidal activations and sqnashing func¬ 
tions) were included for the theoretical purposes of the universal approxi¬ 
mator theorems of Chapter 9. 

2.5 Exercises 

Exercise 2.5.1 (a) Show that the logistic function cr satisfies the ineqnality 
0 < <j\x) < for all x E R. 

( b ) How does the ineqnality change in the case of the function <r c ? 

Exercise 2.5.2 Let S(x) and H(x) denote the bipolar step function and the 
Heaviside function, respectively. Show that 

(a) S(x) — 2 H(x) — 1; 

( b ) ReLU(x) — \x(S{x) + 1). 

Exercise 2.5.3 Show that the softplns function, sp(x ), satisfies the following 
properties: 

(a) sp'(x ) = <t(x), where cr(x) = 1 _^~ x ; 

( b ) Show that sp(x) is invertible with the inverse sp _1 (x) = ln(e x — 1); 

(c) Use the softplus function to show the formula cr(x) = 1 — <j(—x). 



Activation Functions 


39 


Exercise 2.5.4 Show that tanh(x) = 2cr(2x) — 1. 

Exercise 2.5.5 Show that the softsign function, so(x), satisfies the following 
properties: 

(a) It is strictly increasing; 

(b) It is onto on (—1,1), with the inverse so " ~Hx) = iqif, for \x\ < 1 ; 

(c) so(|x|) is subaddtive, i.e., so(\x + y |) < so(|x|) + so(\y\). 

Exercise 2.5.6 Show that the softmax function is invariant with respect to 
addition of constant vectors c = (c,..., c) T , i.e., 

softmax(y + c) = softmax(y). 

This property is used in practice by replacing c = - max^ fact that leads 
to a more stable numerically variant of this function. 

Exercise 2.5.7 Let p : R n —> R n dehned by p(y) G R n , with p(y)i — ypW- 

\\y\\2 

Show that 

(a) 0 < p(y)i < 1 and X* P(y)i = !; 

( b ) The function p is invariant with respect to multiplication by nonzero 
constants, i.e., p{\y) — p{^y) for any A G R\{0}. Taking A = ma ^, leads in 
practice to a more stable variant of this function. 


Exercise 2.5.8 (cosine squasher) Show that the function 



is a squashing function. 

Exercise 2.5.9 (a) Show that any squashing function is a sigmoidal function; 
( b ) Give an example of a sigmoidal function which is not a squashing function. 



® 

Check for 
updates 


Chapter 3 

Cost Functions 


In the learning process, the parameters of a neural network are subject to 
minimize a certain objective function, which represents a measure of prox- 
imity between the prediction of the network and the associated target. This 
is also known under the equivalent names of cost function , loss function , , or 
error function. In the following we shall describe some of the most familiar 
cost functions used in neural networks. 


3.1 Input, Output, and Target 

The input of a neural network is obtained from given data or from sensors 
that perceive the environment. The input is a variable that is fed into the 
network. It can be a one-dimensional variable x G R, or a vector x G R n , a 
matrix, or a random variable X. In general, the input can be a tensor, see 
section B of Appendix. 

The network acts as a function, i.e., modifies the input variable in a certain 
way and provides an output, which can be, again, one-dimensional, y G R, 
or a vector, y G R n , or a random variable, Y, or a tensor. The law by which 
the input is modified into the output is done by the input-output mapping , 
f Wj b- The index (re, b ) suggests that the internal network parameters, while 
making this assignment, are set to (re, b). Following the previous notations, 
we may have either y = f W}b (x ), or y = /^(x), or Y = f w ,b(X). 

The target function is the desired relation which the network tries to 
approximate. This function is independent of parameters re and will be 
denoted either by z — </>(x), in the one-dimensional case, or by z = </>(x), 
in the vector case, or Z = </>(X), for random variables. Also some mixtures 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_3 


41 



42 


Deep Learning Architectures, A Mathematical Approach 


may occur, for instance, z = /(x), in the case when the input is a vector and 
the output is one-dimensional. 

The neural network will tune the parameters (re, b ) until the output vari- 
able y will be in a proximity of the target variable z. This proximity can be mea- 
sured in a variety of ways, involving different types of cost functions, C(w) — 
dist(y,z ), which are parameterized by (re, 6 ), with the distance to be spec- 
ified in the following. The optimal parameters of the network are given by 
(rc*, 6 *) = argmin C(w, b). The process by which the parameters (w,b) are 

w,b 

tuned into (re*, b*) is called learning. Equivalently, we also say that the network 
learns the target function <fi(x). This learning process involves the minimiza- 
tion of a cost function. We shall investigate next a few types of cost functions. 


3.2 The Supremum Error Function 


Assume that a neural network, which takes inputs x G [0,1], is supposed to 
learn a given continuous function (f) : [0,1] —> M. If f w ^ is the input-output 
mapping of the network then the associate cost function is 

C(w,b) = sup | f w ,b(x) ~ H x ) • 

xG[ 0 ,l] 

For all practical purposes, when the target function is known at n points 

Zl = <f>(xi),Z2 = <t>{x 2 ), ...,Z n = 4>{x n ) 


the aforementioned cost function becomes 


C(w, b ) = max | f w ,b(xi) ~ Zi 

1 <i<n 


3.3 The L 2 -Error Function 

Assume the input of the network is x G [0,1], and the target function 0 : 
[0,1] -G M is sqnare integrable. If f w ^ is the input-output mapping, the 
associate cost function measures the distance in the L 2 -norm between the 
output and the target 

C(w, b) = [ (f W: b(x) - cp(x)) 2 dx. 

J 0 

If the target function is known at only n points 

Zl = <t>{x 1),Z 2 = 4 >{x 2), ...,Z n = <j)(x n ), 

then the appropriate associated cost function is the square of the Enclidean 
distance in R n between and z 

n 

C(w, b ) = ^(fwgxi) - Zi) 2 = ||/u,,6(x) - z 

i —1 


5 


(3.3.1) 



Cost Functions 


43 


where x = (aq,..., x n ) and z = ( 2 q,..., z n ). For x, z E R n fixed, C(re, 6 ) is a 
smooth function of (re, 6 ), which can be minimized using the gradient descent 
method. 

The geometric interpretation is given in the following. For any hxed x, the 
mapping (re, b) —> represents a hypersurface in R n parametrized by 

(re, b). The cost function, £7(re, 6 ), represents the Euclidean distance between 
the given point z and a point on this hypersurface. The cost is minimized 
when the distance is the shortest. This is realized when the point is 

the orthonormal projection of z onto the hypersurface. The optimal value of 
parameters are 


(re*, &*) — argmin C(re, b) — argmin || — z 

w,b w,b 


argmindist(z, f W}b (x)). 

w,b 


The fact that the vector z — f w *^*(x) is perpendicular to the hypersurface is 
equivalent to the fact that the vector is perpendicular to the tangent plane, 
which is generated by d Wk f w *(x.) and / 5 * (x) . Writing the vanishing inner 
products, we obtain the normal equation 

n 

^ ^ i^i fw*,b*(%i))d W kfw*,b*(%i) 
i =1 
n 

^ fw*,b* (^z) 

2=1 

In general, this System cannot be solved in closed form, and hence the need 
of numerical methods, such as the gradient descent method or its variants. 


= 0 

= 0 . 


3.4 Mean Square Error Function 

Consider a neural network whose input is a random variable X, and its output 
is the random variable Y — where f w ^ denotes the input-output 

mapping of the network, which depends on the parameter w and bias b. 
Assume the network is used to approximate the target random variable Z. 
The error function in this case measures a proximity between the output 
and target random variables Y and Z. A good candidate is given by the 
expectation of their sqnared difference 

C(w, b ) = E[(y - Z) 2 ] = E l(f w , b (X) - Z) 2 }, (3.4.2) 

where E denotes the expectation operator, see section D.3 of Appendix. We 
search for the pair (re, 6 ), which achieves the minimum of the cost function, 
i.e., we look for (re*, 6 *) = argmin C (re, b). This is supposed to be obtained 




44 


Deep Learning Architectures, A Mathematical Approach 


by one of the minimization algorithms (such as the steepest descent method) 
presented in Chapter 4. 

We shall discuss next a few reasons regarding the popularity of the pre- 
vious error function. 


1. It is known that the set of all square-integrable random variables on a 
probability space forms a Hilbert space with the inner prodnct given by 
(X,Y) = E[XY]. This dehnes the norm ||X|| 2 = E[X 2 ], which induces the 
distance d(X,Y) — \\X — Y ||. Hence, the cost function (3.4.2) represents the 
square of a distance, C(w,b) — d{Y , Z) 2 . Minimizing the cost is equivalent to 
finding the parameters (w, b ) that minimize the distance between the output 
Y and target Z. For the definitions of norm and Hilbert spaces, see section 
D.3 in Appendix. 


2. It is worth emphasizing the relation between the aforementioned cost func¬ 
tion and the conditional expectation. The neural network transforms the 
input random variable X into an output random variable Y — f w ^(X), which 
is parameterized by w and b. The information generated by the random vari¬ 
able f w ^(X) is the sigma-held £ w ^ — &(f Wj b(X)), i.e., the smallest sigma-held 
with respect to which f w ^(X) is measurable. All these sigma-fields generate 
the output information, £ — \f wb £ w ,b, which is the sigma-held generated by 
the union {J wb £ w ,b- (The notation £ comes from from exit). 

In general, the target random variable, Z, is not determined by the infor¬ 
mation £. The problem now can be stated as in the following: Given the out¬ 
put information £, find the best prediction of Z based on the information £. 
This is a random variable, denoted by Y — E[Z|£], called the conditional 
expectation of Z given £. The best predictor, Y, is determined by £ (i.e., it is 
^-measurable) and is situated at the smallest possible distance from Z, i.e., 
we have (with notations of part 1) 


d(Y, Z) < d{U , Z) 


for any ^-measurable random variable U . This is equivalent to 

E[(Y - Z) 2 } <E[{U - Z)\ (3.4.3) 


which means that 


Y — argminE[([/ — Z) 2 ]. 

If U — f Wj b(X), then the right side of (3.4.3) is the cost function C(w,b). 
Then Y = f w *,b*(X ) minimizes the cost function, and hence 

(w*, 6*) — argminE [(f w ^(X) — Z) 2 ] — argmin C(w, b). 

w,b ’ w,b 






Cost Functions 


45 


z 



Figure 3.1: The best predictor of the target Z, given the information £, is the 
orthogonal projectiori of Z on the space of £-measurable functions. 


We note that Y is the orthogonal projection of Z onto the space of 
£-measurable functions, see Fig. 3.1. The orthogonality is considered in the 
sense introduced in the previous part 1 , and Z is considered to belong to the 
Hilbert space of square integrable functions. 

Another plastic way to present the problem is as in the following. Assume 
that the humans’ goal is the ultimate knowledge of the world. The inputs to 
our brain are provided by the stimuli from our surroundings that our senses 
are able to collect and evalnate. This information is interpreted by the brain’s 
neural network and an output information is obtained, which represents our 
knowledge about the world. This is a subset of the total information, which we 
might never be able to comprehend. However, given the limited information 
we are able to acquire, what is the best picture of the real world one can 
obtain? 

3. The definition of the cost function can be easily extended to the case when 
the random variables are known by measurements. Consider n measurements 
of random variables (A, Z), which are given by (aq, 2 q), (aq, £2), • • •, ( x n , z n ). 
This forms the so-called training set for the neural network. Then in this case 
the cost function is dehned by the following average 


C(w, b) 


1 

n 


( fw,b( x j ) 

3 = 1 



2 

5 


which can be considered as the empirical mean of the square difference of Y 
and Z. 









46 


Deep Learning Architectures, A Mathematical Approach 


For the sake of completeness, we shall discuss also the case when the input 
variable X and the target variable Z are independent. In this case the neu- 
ronal network is trained on pairs of independent variables. What is the best 
estimator, Y, in this case? 

Since X and Z are independent, then also fo(X) and Z will be inde¬ 
pendent for all values of the parameter 9 — (re, b ). Then the output informa- 
tion, £, which is generated by the random variables fo(X) will be independent 
of Z. The best estimator is 

Y = E[Z\S\ = E[Z], 

where we used that an independent condition drops ont from the conditional 
expectation. The best estimator is a number, which is the mean of the target 
variable. This will be expressed more plastically in the following. 

Let’s say that an experienced witch pretends to read the future of a 
gullible customer using the coffee traces on her mug. Then the coffee traces 
represent the input X and the customer future is the random variable Z. 
The witch-predicted future is the best estimator, Y, which is the mean of all 
possible futures, and does not depend on the coffee traces. Hence, the witch 
is an impostor. 

The next few cost functions measnre the proximity between probability 
densities of random variables. 


3.5 Cross-entropy 


We shall present the definitions and properties of the cross-entropy in the 
one-dimensional case. Later, we shall use them in the 2-dimensional case. For 
more details on this topic the reader can consuit [81] and [22]. 

Let p and q be two densities on R (or any other interval). It is known 
that the negative likelihood function, —l q (x) — — ln q(x), measures the infor¬ 
mat ion given by q(x). 1 The cross-entropy of p with respect to q is defined by 
the expectation with respect to p of the negative likelihood function as 


s( P ,q) = w>[-e q 


/ p(x) lnq(x) dx. 

Jr 


This represents the information given by q(x) assessed from the point of view 
of distribution p(x), which is supposed to be given. It is obvious that for dis¬ 
crete densities we have S(p,q) > 0. Actually, a sharper ineqnality holds: 


1 This is compatible with the following properties: (i) non-negativity: —£ q (x) > 0 ; (ii) we 
have —t qi q 2 (x) = ~£ qi (x)—i q2 (x) for any two independent distributions qi and (iii) the 
information increases for rare events, with lim q ( x ^ 0 (—£ q (x)) = oo. 




Cost Functions 


47 


Propositiori 3.5.1 We have S(p,q) > H(p), where H(p) is the Shannon 
entropy, see [111] 

H{p) — —E p [£ p ] = — I p(x) lnp(x) dx. 

Jr 

Proof: Evaluate the difference, using the inequality Inu < u — 1 for u > 0: 


S{p,q)~H(p ) = - 


p(x) ln q(x) dx+ p(x) lnp(x) dx 

M J R 


p(x)\nP]dx> [ p{x) 

'R P\X) J K \p{x) 

q(x) dx - Jp(x) dx — 1 — 1 — 0. 

Hence, S(p, q) — H(p) >0, with S(p , q) — H{p ) if and only if p = q. ■ 

The previous resuit States that for a given density p, the minimum of 
the cross-entropy S(p,q) occurs for q — p\ this minimum is equal to the 
Shannon entropy of the given density p. It is worth noting that in the case 
of a continuous density the entropy H ( p ) can be finite (positive or negative) 
or infinite, while in the case of a discrete distribution it is always finite and 
positive (and hence the cross-entropy becomes positive). 

The Shannon entropy H{p) is a measure of the information contained 
within the probability distribution p(x). Therefore, the previous proposition 
States that the information given by the distribution q(x) assessed from the 
point of view of distribution p{x) is always larger than the information defined 
by the distribution p(x). 

Corollary 3.5.2 Let p be the density of a one-dimensional random variable 
X on R. Then 

H(p) < - ln(27reEar(X)). 

2 

Proof: Let p = E[X] and a 2 — Var(X ), and consider the normal density 


— 1 1 dx 


Q-m) 


q(x) — J- e 2a 2 5 which has the same mean and variance as p(x). The 

\ Z7T CJ 

cross-entropy of p with respect to q can be computed as 

S(p, q) = — [ p(x) ln q(x) dx — - ln(27rcr 2 ) H-^ [ p(x)(x — p) 2 dx 

Jr 2 2a z J R 

= - ln(27 ver 2 ) -j— = - ln(27recr 2 ). 

2 2 2 

By Proposition 3.5.1 we have H{p) < S(p,q) — ^ ln(27 rea 2 ). The equality is 
reached for random variables X that are normally distributed. ■ 


Remark 3.5.3 Cross-entropy is often used in classification problems, while 
the mean sqnared error is usually useful for regression problems. 








48 


Deep Learning Architectures, A Mathematical Approach 


3.6 Kullback-Leibler Divergence 

The difference between the cross-entropy and the Shannon entropy is the 
Kullback-Leibler divergence (see [69-71]) 

Dkl{p\\q ) = S(p, q ) - H(p). 


Equivalently, this is given by 

Dkl(p\\q) 



By the previous resuit, Dkl(p\\q) > 0. However, this is not a distance, since 
it is neither symmetric, nor it satisfies the triangle inequality. 

Both the cross-entropy and the Kullback-Leibler divergence can be con- 
sidered as cost functions for neuronal networks. This will be discussed in the 
following. 

Let X be the input random variable for a given neural network, and let 
Y — /#(A, <$;) be its output, where 9 — (w,b) and £ is a random variable 
denoting the noise in the network. The target random variable is denoted 
by Z. Another way to match the output Y to target Z is using probability 
density functions. The conditional density of Y , given the input A, is denoted 
by po(y\x), and is called the conditional model density function. The joint 
density of (A, Z) is denoted by p x z (x, z), and it is regarded as the training 
distribution. There are several ways to compare the densities, as it will be 
shown next. 

One way is to tune the parameter 0, such that for the given training 
distribution p — p x Z (x,z ), we obtain a conditional model distribution q — 
Po(-\x) for which the cross-entropy of p and q is as small as possible. The value 


6 


* 


arg min S (p xz ,p e (Z\X)) 
0 


is the minimum for the cost function 


C{6) = S{p XtZ ,p e {Z\X)). 

In the best-case scenario, in the virtue of Proposition 3.5.1, the aforemen- 
tioned minimum equals the Shannon entropy of the training distribution, 

Let p x (x) be the density of the input variable A. Using the properties of 
conditional densities, the previous cross-entropy can be written equivalently 



Cost Functions 


49 


as 


C{0 ) = S(p xz ,p e (Z\X)) = - ff p xz (x,z)lnp e (z\x) dxdz 


Pe(x, z ) 
Px(x) 


dxdz 


- ff P X ,z( X l Z ) ln ( 

— ff p xz (x,z ) lnp 0 (cc,z) dxdz + ff p x z (x, z)lnp x (x) dxdz 


= S{p 


X,Z 5 


pg(X,Z))+ I ( I p xz (x,z)dzjlnp x (x)dx 


S(p XtZ ,Pe{X,Z)) + f p x (x) ln p x (x)dx 


= S(p x z ,pg(X,Z)) - H(p x ) 


where H(p x ) is the input entropy , i.e., the Shannon entropy of the input 
variable X. Since H(p x ) is independent of the model parameter 0, the new 
cost function 

c(e) = s( PxtZ , P g(x, z)), 

which is the cross-entropy of the training density with the model density, 
reaches its minimum for the same parameter 9 as C{6) 


0* — argmin C{9) — argminC l (0). 


In conclnsion, given a training density, and either a model density, 

Pq(X,Y), or a conditional model density, po(Y |X), we search for the param¬ 
eter value 6 for which either the cost C{6) or C(0), respectively, is minimum. 

In practical applications, the random variables (X, Z ) are known through 
n measurements 

(xi,Zi), (x 2 ,Z 2 ), . . . , (x n ,Z n ). 

In this case, we assume that the joint density of the pair (X, Z ) is approxi- 
mated by its empirical training distribution p x z (x, z). The new cost function 
is the cross-entropy between the empirical training distribution dehned by the 
training set and probability distribution dehned by the model. Approximat- 
ing the expectation with an average, we have 


C(0) = S(p X ' Z ,p 0 (ZjX)) =E*x.z 


lnp 0 (Z\X)} 


n 


/ L 


3 = 1 


A similar measurement-based error can be dehned as 


C{9) = S(p XtZ ,p 0 (X,Z)) = E^ 


ln p 9 {X, Z )] 


1 


n 


y^ln Pe{x j ,z j ). 



50 


Deep Learning Architectures, A Mathematical Approach 


There is an equivalent description to the previous one, when the cost function 

C(6) = D KL (p XiZ \\p 0 (X, Z)) 

is given by the Kullback-Leibler divergence of the training density with the 
model density. Since Shannon entropy, H(p x z ), is independent of the param- 
eter 0, we have 

9* = aigmmD KL (p xz \\p e (X, Z)) = argmin S(p x z ,p 9 (X, Z)). 

0 0 

In the best-case scenario, when the training and the model distributions 
coincide, the previous minimum is equal to zero. 

In the case when a training set is provided 

(x 1 ,zi),(x 2 ,z 2 ),...,(x n ,z n ), 


the cost function is written using the empirical density, p x z , as 


C{6) = E *x,z - l n 


Po(X, Z) 


1 U 

= - ^2 Zi) -ln p(xi,Zi)). 

U 3 = 1 


Maximum Likelihood The minimum of the aforementioned empirical cost 
function 

_ i n 

C(6) = E Px ’ z [- ln pe{X, Z)\ = — ^ ln pe(xj, zj), 

TX/ 

3 = 1 

which is 8 * = argminC(0), has the distinguished statistical property that is 

0 

the maximum likelihood estimator of 0, given n independent measurements 

{xi,Zi), (x 2 ,Z 2 ), (x n ,Zn). 

This follows from the next computation, which uses properties of logarithms: 


0* = argminC l (0) — argmax — ln po(xj, Zj) 


3= 1 


arg max ln p@[xj , Zj) — arg max ln n Po{xj,Zj) 


3 = 1 


3 = 1 


— argmax Zj) — argmaxp 5 >(X = x, Z — z) 


3= 1 


= 8ml • 


The popularity of the empirical cross-entropy and Kullback-Leibler diver¬ 
gence as cost functions is dne to this relationship with the maximum likeli- 
hood method. Furthermore, it is the hope that using these cost functions in 
a neural network will lead to cost surfaces with less plateaus than in the case 
of sum of squares cost function, fact that improves the network training time. 



Cost Functions 


51 


3.7 Jensen-Shannon Divergence 

Another proximity measure between two probability distributions p and q is 
given by the Jensen-Shannon divergence 

■j^ 

Djs(p\\q) = 2 \ D KL(p\\m) + D KL (q\\m)j, (3.7.4) 

where m — \(p~\~ q) and Dkl denotes the Kullback-Leibler divergence. This 
will be useful in optimizing GANs in Chapter 19. 

Propositiori 3.7.1 The Jensen-Shannon divergence has the following prop- 
erties: 

(i) Djs(p\\q) > 0 (non-negative); 

(ii) Djs(p\\q) — 0 p — q (non-degenerate ); 

(iii) D JS (p\\q) = D JS (q\\p) (symmetric). 

Proof: (i) It follows from the fact that the Kullback-Leibler divergence is 
nonnegative, namely Dkl(v ll m ) > 0, Drl(q\\ m) > 0. 

(ii) If p = q, then p — q — m and then Dkl(v Il m ) — Dkl(qW m) — 0, 
which implies Djs(p\\q) — 0. Conversely, if Djs(p\\q) — 0, then Dkl(p ||tti) + 
Drl(qW m) — 0, and since the Kullback-Leibler divergence is nonnegative, it 
follows that Dkl(p\\ m) — 0 and Dkl(qW^) — 0. This implies p — m and 
q — m. Multiplying by 2 we obtain 2p — p + g, which is equivalent to p — q. 

(iii) The symmetry follows from the addition commntativity of the right-side 

terms in formula (3.7.4). ■ 

The Jensen-Shannon divergence will be useful in Chapter 19 in the study 
of GANs. 

3.8 Maximum Mean Discrepancy 

If A is a continuous random variable with probability density p(x) over the 
space E A, then for any vector function (j) : A —> we dehne the (j)-moment 
of A by 

/i^(A) = E[0(A)] = / cj)(x)p(x) dx. 

Jx 

For instance, if (j)(x) — x , then /i^(A) is the mean, or the hrst moment of A. 
If cj)(x) — (x, x 2 ,..., x N ) T , then /i^(A) is an A-dimensional vector containing 
the hrst N moments of the random variable A. 

Now, we shall consider two continuous random variables, A and Y on the 
same space A and having probability densities p(x) and q(y ), respectively. 


52 


Deep Learning Architectures, A Mathematical Approach 


For a fixed function : X —> , the proximity between p and q can be 

measured by the Euclidean distance between the 0-moments of X and Y as 

d MMD(Pi<l ) = d Eu (v<t>(X), ii<i>(Y)) = \\ H {X) - n<j,(Y)\\Eu- (3.8.5) 

The number d MMD (p , g) is called the maximum mean discrepancy of p and q. 
Formula (3.8.5) can be also written in the more convenient integral form 


d 


MMD 


(p, q) — / (p(u) — q{u))(j){u) du 

Jx 


Eu 


We note that this is a generalization of the L 1 -distance, which is obtained 
in the particular case <fi(x) — 1. We shall emphasize in the following the 
relation with kernels. 

Using that the length of any vector v can be written as \\v 
have 


2 T 

= v v< we 


d 


MMD 



\\E p [4>{X)]-E q [4>{Y)}\\ 2 Eu 

T 

(e p [<p(x)} - E,[0(y)]J (e p [0(x)] - E,[^(y)]) 

e p [0(x) t ]e p [^(x)] + E,[0(y) r ]E,[0(y)] 

- E p [0(x) r ]E,[0(y)] - E,[0(y) r ]E p [0(x)]. 


We shall compute each term using Fubinhs theorem and show that the last 
two are equal as in the following: 


E p [^X) T }E p [(p(X)] 

E,[0(y) T ]E g [0(y)] 

E p [<f>(X) T }E q [<P(Y)} 


(j){x) T p(x) dx I (j){x , )p{x t ) dx' 



(j)(x) T cj)(x r )p(x)p(x r ) dxdx 



<t>{y) T <t>{y')p{v)p{y') dydy' 


cj)(x) T p(x) dx I (j){y)q{y)dy 


(j){x) T (j>{y)p{x)q{y) dxdy. 


Since cj)(x) T (j){y) — (j){yY (j>(x), we obtain 


T 


Ep[</>(y) T ]Ej(/>(y)] = E q [<p(Y) T ]E p [<p(X)}. 


Substituting back, yields 












Cost Functions 


53 


d 


MMD 


(p, Q ) 2 = JJ 4>{x) T (t>{x')p{x)p{x') dxdx' + JJ <p(y) T <p(y')p(y)p(y') dydy' 


— 2 / (j){x) T '(j>{y)p{x)q{y) dxdy. 


Consider the kernel 


N 

K(u,v) = <f>(u) T <t>(v) = '^2cf) j (u)(t) j (v). 

3 = 1 


This kernel is symmetric, K(u,v) — K(v,u), and is nonnegative definite, 
namely for any real-valued function a(x) we have 


JJ K(u, v)a(u)a(v) dudv > 0. 

The last inequality follows from the use of Fubini’s theorem and the dehnition 
of the kernel K(u,v) as 


N 

JJ K(u, v)a(u)a(v) dudv — JJ <frj(u)a(u)<frj(v)a(v) dudv 

3 = 1 

N r 2 

= M u ) a ( u )j ^°- 

3 = 1 

Then the aforementioned expression for the maximum mean discrepancy 
can be written in terms of the kernel as 


d 


MMD 


(p, q ) 2 = JJ K(x,x')p(x)p(x') dxdx' + JJ K(y,y')p(y)p(y') dydy' 
— 2 J K(x,y)p(x)q(y) dxdy 
— [[ K(u, v) (p{u)p{v) + q{u)q{v) — 2p(u)q(v)\ dudv 



— K(u,v)(p(u) — q(u))(p(v) — q(v)) dudv. 


where we changed variables and grouped the integrals under one integral. 
In the case of discrete random variables the previous formula becomes 


d M MD (P > <l) 2 = T K ij(Pi ~ 9i )(Pj ~ %)’ 



54 


Deep Learning Architectures, A Mathematical Approach 


where K{j — Kp — K(ui,Uj) is a symmetric, nonnegative definite matrix 
defined by 

N 

K-ij Ct>(Ui ) ^ (f)k (Uj) 

k =1 

and pi — p(ui ), qj — q{uj ), where {u\, ..., u^} denotes the sample space. 

Maximum Mean Discrepancy For ali practical purposes, the random 
variable X is known from a sample of n observations, x \,..., x n , which are 
drawn from the distribution p(x). Similarly, the random variable Y is known 
from a sample of nn observations, yi,...,y m , drawn from the distribution 
q(y). The means are estimated as averages by 


1 % \ 

E X ~ P [(f>{x)} = ~y2<f>(xi), 

n z ' 

i=1 

. m 

E Y „ q [<j>(Y)] = -W 0(2/0, 

i=i 

and the maximum mean discrepancy between p and q can be estimated using 
the previous two samples as 


^MMD (.Pi Q) 


1 n 1 m 

Y ^ Xi ) — Y^ y ^ 

i=i i=i 


-t n -t m t -] n m 

= (~Y^ x ^ — Y^iyi)) (~Y^ x ^ — Y^ y ^) 

\n z ' rn z ' / \n z ' rn z —' / 

i=l i=1 i=l i=l 

2 _ i _ 2 

= Y K ^y^ - ~Y K ^y^’ 


h 3 


h3 


h3 


(3.8.6) 


where we used the kernel notation K(x,y) — c/)(x) T (j){y). 

The maximum mean discrepancy will be used in Chapter 19 in the study 
of generative moment matching networks. 


3.9 Other Cost Functions 


Other possibilities of forming cost functions by comparing the model density, 
p#(x,z), to the training density, p xz (x,z), are glanced in the following. For 
more details, the reader is referred to [22]. 


L 1 -distance Assuming the densities are integrable, the distance between 
them is measured by Di(p#,p x z ) — ff \p x z (pc, z) — po{x, z)\ dxdz. The min¬ 
imum of D\ is zero and it is reached for identical distributions. 











Cost Functions 


55 


L 2 -distance Assuming the densities are square integrable, the distance 
between them is measured by D2(po,p x z ) — ff (p x z( x ? z ) ~Po( x , z )) dxdz. 
For identical distributions this distance vanishes. 

Hellinger distance Another variant to measure the distance is 


H 2 (Pe,P x ,z ) = 2 




Jeffrey distance This is given by 

J(Pe,P x ,z) = 2 [i {pe(x,z)-p xz (x,z))(lnp e (x,z)-lnp xz (x,z))dxdz. 


Renyi entropy For any a > 0, a 7 ^ 1 define the Renyi entropy by 

H a (p) — — ln / p(x) a dx. 

a J 

This generalizes the Shannon entropy, which is obtained as a limit, H(p) — 
lim . a ->i H a (p), see Exercise 3.15.9. A distinguished role is played by the 
quadratio Renyi entropy , which is obtained for a — 2 


H 2 (p) 



3.10 Sample Estimation of Cost Functions 

The practical utility of the aforementioned cost functions in machine learning 
resides in the ability of approximating them from a data sample, {(xi, zi ),..., 
(xjv, zn)}. In the following we shall present a few of these estimations. 

Mean squared error The expectation of the squared difference of the tar- 
get, Z and the network outcome, Y — /(X;0), can be estimated as the 
average 

1 N 

n(z - f(X; e)) 2 } « - - /(*;; 0) 2 - 

3 = 1 

Quadratic Renyi entropy This estimation will use the Parzen window 
method [94]. We replace hrst the density p(x) by a sample-based density 
using an window W a as 



1 

N 


K 

y^w a {x,xk). 

k=1 






56 


Deep Learning Architectures, A Mathematical Approach 


For simplicity reasons we assume the window as an one-dimensional Gaussian 

W a (x,x k ) = J_ . 


y/2 


na 


Consider the quadratio potoutial errorgy U(p) = / p(*f d*. Since the 

quadratio Renyi entropy is i^Cp) = — ln Jp(x) 2 dx — —ln U(p), then it 
suffices to estimate U(p). The estimation is given by 


U(p) = / p(x) 2 dx — / p{pc)p{x) dx 


N 1 N 

J k =i 


1 

N 2 


fc=l 
iV iV 



W G {x, Xk)W a (x , Xj) dx. 


k =i j=i 


In the case when the window is Gaussian, W a (x,x') — (j> a (x — x'), with 
4> a (t) = J- e 2 ct 2 the previous integral can be computed explicitly by 

v 7 y/ZTTcr 

changing the variable and transforming it into a convolution 
J W(r(x, Xk)Wcr(x, Xj) dx — J (f) a (x — Xk)(j>a{ x — Xj) dx 

(j) a (t)(j) a (t - (Xj - Xk)) dt 

(0CT * 0cr)(^j x k) ^cr a/2 ( X 3 X k) 

= W a^ X A X k)- 

In the last equality we have used that the convolution of a Gaussian with 
itself is a scaled Gaussian, see Exercise 3.15.10. Substituting back into the 
quadratio potential energy, we arrive to the following estimation: 

N N 


m = pEEVi’ 1 *)' 

k= 1 j=1 

Consequently, an estimation for the quadratio Renyi entropy is given by 


H 2 (p) = 


- ln im 


N N 



W *V2( X J’ X k) 


k=1j=1 


(3.10.7) 








Cost Functions 


57 


Integrated squared error If pz and py represent the target and outcome 
densities, the cost function 


C{pz,Py) = j \pz(u) - p Y (u ) 


du 


can be written using the quadratio potential energy as 

C{pz,Py ) = U(p z ) + U(p Y ) -2 Jpz{u)p Y {u) du, 

where U(p) stands for the quadratio potential energy of density p defined 
before. Then the estimation takes the form 

C(pz,Py ) = U(p z ) + U(p Y ) ~ 2 J pz(u)py(u) du. 

The hrst two terms have been previously computed. It suffices to deal only 
with the integral term, called also the Renyi cross-entropy. We have 

- i JV i N' 

Pz(u)p Y (u) du = j v <r(u,yj)du 

■’ V 7=1 k= 1 


3 = 1 
N N' 


NN' 



W a (u , Zj)W<r(u, Vj) du 


j =1 k =1 
N N' 


NN' 



^(7\P2^ z y Vj ) 


3= 1 k =1 
JV JV' 


NN' 



Kv 2 ( z jJ(xj;0)) 


j= 1 k=l 


where y — f(x\Q ) is the input-output mapping of the neural net. Therefore. 
we obtain the following estimation 


W N 


N' N' 


C(pz,Py ) = 



i=i fe=i 
W JV' 


W-V 2 (Zj, Zk) + — 2 2, 2^ W aV - 2 (f(xp,0), f(x k ; 9)) 


j =i fe=i 


ATiV' 



W aV - 2 { Zj ,f{xy,e)). 


j =1 fc=l 


Maximum Mean Discrepancy For all practical purposes, the random 
variable X is known from a sample of n observations, aq,..., x n , which are 
drawn from the distribution p(x). Similarly, the random variable Y is known 



58 


Deep Learning Architectures, A Mathematical Approach 


from a sample of m observations, yi,..., j/ m , drawn from the distribution 
q(y). The means are estimated as averages by 

Ti 

Ex~p[(f>{X)} = - y <j>(xj), 

n z ' 

i=i 

. m 

E Y ~ q [<KY)] = 

m z ' 

i=i 

and the maximum mean discrepancy between p and q can be estimated using 
the previous two samples as 


d mmd (?) o) 


-t n 1 m 

-^4>(xi) - ^4>{yi) 

n rn 

i=1 i=l 

-t n i m T 1 n 1 m 

~^2<f)(xi) — ^2<p(yi)) — xyoo 

n z ' m z ' / \n z ' rn z —' 

i=l ?’=1 i=l i=l 


n 


2 _ 2 _ 2 

2 ^( 3 *,^) + ^ 22 K (yiiVi) - — 


mn 


h3 


hj 




(3.10.8) 


where we used the kernel notation K(x,y) — (j)(x) T (j)(y). 


3.11 Cost Functions and Regularization 


In order to avoid overktting to the training data, the cost functions may 
include additional terms. It was noticed that the model overhts the training 
data if the parameters are allowed to take arbitrarily large values. However, if 
they are constrained to have bounded values, that would impede the model 
capability to pass through most of the training data points, and hence to 
prevent overktting. Hence, the parameter values have to be kept small about 
the zero value. In order to minimize a cost function subject to small values 
of parameters, regularization terms of types L 1 or L 2 are usually used. 

L 2 -regularization Consider the weights parameter w E R n . Its L 2 -norm, 

|re|| 2 , is given by 11^111 = Yn=i The cos ^ fumTkm with L 2 -regularization 
is obtained by adding the L 2 -norm to the initial cost function 


Lziw) — C(w ) + A11 w 


2 

25 


where A is a positive Lagrange multiplier, which Controls the trade-off between 
the size of the weights and the minimum of C{w). A large value of A means a 
smaller value of the weights, and a larger value of C(w). Similarly, a value of A 



Cost Functions 


59 


closed to zero allows for large values of the weights and a smaller value of the 
cost C(w), case that is prone to overfitting. The value of the hyperparameter 
A should be selected such that the overfitting effect is minimized. 


L 1 -regularization The L 1 -norm of w E R n is defined by 
The cost function with L 1 -regularization becomes 


w 


i 


= TL 


1 



L\(w) — C(w ) + X\\w 


i 


where A is a Lagrange multiplier, with A > 0, which Controls the strength 
of our preference for small weights. Since ||re||i is not differentiable at zero, 
the application of the usual gradient method might not work properly in this 
case. This disadvantage is not present in the case of the L 2 -regularization. 


Potential regularization This is a generalization of the previous two reg- 
ularization procednres. Consider a function U : R n —> R+ satisfying 

(i) U(x) = 0 if and only if x — 0 

(ii) U has a global minimum at x — 0. 

In the case when U is smoothly differentiable, condition (ii) is implied by the 
derivative conditions U'( 0) = 0 and U"( 0) 0. The potential function U is a 

generalization of the aforementioned L 1 and L 2 norms. 


The regularized cost function is defined now as 


G(w) — C(w) + A U(w), A > 0. 


Choosing the optimum potential with the best regularization properties 
for a certain cost function is done by verifying its performance on the test 
error (see for dehnition section 3.12). The test error has to decrease signifi- 
cantly when the term U(w) is added to the initial cost function C(w). More 
clearly, if ct es t, i, e test,2 are the test errors corresponding to the cost C(w) and 
to the regularized cost G(w ), respectively, then U should be chosen such that 


^test ,2 ^ ^test, 1 - 

In this case we say that the neural network generalizes well to new unseen 
data. We shall deal in more detail with this type of errors in the next section. 

3.12 Training and Test Errors 

One main feature that makes machine learning different than a regular opti- 
mization problem, which minimizes only one error, is the double optimization 
problem of two types of errors, which will be discnssed in this section. It is the 











60 


Deep Learning Architectures, A Mathematical Approach 


difference between these two errors that will determine how well a machine 
learning algorithm will perform. 

In a machine learning approach the available data {(a^,^)} is divided 
into three parts: training set , testing set and validation set. It is assumed 
that all these sets are identically distributed, being generated by a common 
probability distribution; another assumption is that each of the aforemen- 
tioned data sets are independent of each other. Size-wise, the largest of these 
is the training set T (about 70%), followed by the test set T (about 20%) 
and then by the validation set V (about 10%). 

The cost function evaluated on the input data given by the training set is 
called training error. Similarly, the cost function evaluated on the input data 
given by the test set is called test error or generalization error. For instance, 
the errors given by 


C T (w, b) 


FI ( fw,b{ X j ) Z j ) > 


e r 


C T (w, b ) 


1 

k 


k 

F/ ( fw,b( x j ) 

3 = 1 




) €T, 


with m — card (T) and k — card(T) are the training error and the test 
error, respectively, associated with the average of the sum of squares. Similar 
training errors can be constructed for the other cost functions mentioned 
before. 


In the first stage, an optimization procedure (such as the gradient descent) 
is used to minimize the training error Cp(w, b ) by tuning parameters (re, b). 
Let’s denote by (re*, b*) their optimal values. This procedure is called training. 

In the second stage, we evalnate the test error at the previous optimal 
parameter values, obtaining the testing error Cr(re*,6*). In general, the fol- 
lowing ineqnality is expected to hold: 

C r (rc*,6*) < Or(re*,6*). 

The following few variants are possible: 

(i) Both error values, Cp(w *, &*) and Cr(re*,6*), are small. This means that 
the network generalizes well, i.e., performs well not only on the training set, 
but also on new, unseen yet inputs. This would be the desired scenario of 
any machine learning algorithm. 

(ii) The training error Cp(w*,b*) is small, while the test error Cr(rc*,6*) 
is stili large. In this case the network does not generalize well. It actually 



Cost Functions 


61 



Figure 3.2: Overtraining and undertraining regions for a neural network. 
Since the training and testing errors are stochastic the optimal stopping is 
a random variable which is mostly contained in the dotted circle. 


overfits the training set. If the training error is zero, then the network is 
“memorizing” the training data into the System of weights and biases. In 
this case a regularization techniqne needs to be applied. If even after this 
the test error does not get smaller, probably the network architecture has to 
be revised, one way being to decrease the number of parameters. In general, 
over-parametrization of the network usually leads to overfitting. 

(m) Both error values, C'f(w*,b*) and Ct(w*,&*), are large. In this case 
the neural network underfits the training data. To fix this issue, we need 
to increase the capacity of the network by changing its architecture into a 
network with more parameters. 

The validation set is used to tune hyperparameters (learning rate, network 
depth, etc.) such that the smallest validation error is obtained. Finding the 
optimal hyperparameters is more like an art rather than Science, depending 
on the scientisfs experience, and we shall not deal with it here. 

One other important issue to discuss here is overtraining and optimal 
stopping from training. By training for a long enough time (involving a large 
enough number of epochs) the training error can be made, in general, as small 
as possible (provided the network capacity is large enough). In the beginning 
of training period the testing error also decreases until a point, after which 
it starts increasing slowly. The optimal stopping time from training is when 
the testing error reaches its minimum. After that instance the gap between 




62 


Deep Learning Architectures, A Mathematical Approach 


the training error and testing error gets larger, fact that leads to an overht, 
namely the network performs very well on the training data but less well on 
the testing data, see Fig. 3.2. 

In fact, the training and testing errors are not deterministic in practice. 
The use of the stochastic gradient descent method (or its variants) assures 
that errors are stochastic. If ( wt,bt ) are the parameter values at the time 
step £, then we consider the stochastic process Ct — Cp(wt , &t), which is the 
sequence of the training errors as a function of time steps. It has been deter- 
mined experimentally that this process has a decreasing trend, approaching 
zero, with a nonvanishing variance, see Fig. 3.2. One simple way to model 
this error is to assume the recurrence" of first order Ct+i — (j)Ct + with 
0 < (j) < 1 , and aet an white noise term controlled by the parameter a. This 
says that the value of the error at step t + 1 is obtained from the value of the 
error at step t by shrinking it by a factor of 0 , and then add some noise. 

The process can be transformed into a difference model as Ct+ 1 — Ct — 
— (1 — (fyCtdt + aet. Considering the time step small, this becomes a stochastic 
differential equation, dCt — —rCt + adWt, where r — 1 — </> > 0 and the white 
noise et was replaced by increments of a Brownian motion, dWt, which are 
normally distributed, see section D.8 in the Appendix. The solution to this 
equation is the Ornstein- Uhlenbeck process 

Ct = c 0 e~ rt + a f e- r{t ~ u) dW u . 

J 0 

This is the sum of two terms. The first is the mean of the process, which 
shows an exponentially decreasing trend. The second is a Wiener integral, 3 
which is a random variable normally distributed, with mean zero and vari- 
ance fy(l ~ e~ 2rt ). The model parameters r and a depend on the network 
hyperparameters, such as learning rate, batch size, number of epochs, etc. 

Example 3.12.1 There are several data sets on which machine learning algo- 
rithms are usually tested for checking their efficiency. A few of these examples 
are glanced in the following. 

(i) The MNIST data set of handwritten digits (from 0 to 9) consists of 60,000 
training and 10,000 test examples. Each image has 28 x 28 pixels. Usually 
5,000 images from the training examples are used as validation set. Feedfor- 
ward and convolutional networks can be trained on this data. 


2 In statistics this is called an autoregresive AR(1) model. 

3 The reader can think of the integral It = f 0 f(u) dW u as a random variable It obtained 
as the mean square limit of the partial sums S n = f(ui)(W Ui+1 — UT-). Its distribu- 

tion is normal, given by It ~ A/"(0, f* f(u) 2 du). 



Cost Functions 


63 


(ii) The CIFAR-10 data set contains 50,000 training and 10,000 test images 
and has 10 categories of images (airplanes, cars, birds, cats, deer, dogs, frogs, 
horses, ships, and trucks). Each one is a 32 x 32 color image. Usually 5,000 
images are kept for validation. State-of-the-art results on the CIFAR-10 
dataset have been achieved by convolutional deep belief networks, densely 
connected convolutional networks and others. 

(iii) The CIFAR-100 data set is similar with the CIFAR-10, the only differ- 
ence being that it has 100 image categories. 

(iv) The Street View House Numbers (SVHN) is a real-world image data set 
containing approximately 600,000 training images and 26,000 test images. 
6,000 examples out of the training set are used as validation set. SVHN is 
obtained from house numbers in Google Street View images. This data can 
be processed using convolutional networks, sparse autoencoders, recurrent 
convolutional networks, etc. 


3.13 Geometric Significance 

If C(w,b) and [/(re) are both smooth, then the regularized cost function 
G(re, b ) = C(re, b) + A U (re) is also smooth, and its minimum is realized for 

(re*, 5*) = argmin(C(re, b) + A U(w)). 


At this minimum point we have the vanishing gradient condition satished, 
V£?(re*,5*) = 0, which easily implies \7 w C(w*, 5*) = —XV w U(w*). This 
means that the normal vectors to the level surfaces of C(w,b) and [/(re) 
are collinear. This occurs when the level surfaces are tangent, see Fig. 3.3. 
Since the normal vectors to the previous level surfaces are collinear and of 
opposite directions, it follows that A > 0. 


The significance of A In order to understand better the role of the multiplier 
A, we shall assume for simplicity that U (re) = 11 111. Let re* be a contact point 

between the level curves {C(re,b) — k} and {U(w) — c}, see Fig. 3.3. Then 
the equation V w C(w*,b*) — — \\7 w U(w*) becomes V w C(w*,b*) — —2Are*. 
This implies 


V w C(w*,b*) 



that is, the magnitude of the normal vector to the level surface {C(re, b) — k} 
at re* depends on A and c. The following remarks follow: 

(i) Assume V^C^re*, 6*) ^ 0. Then small (large) valnes of c correspond 
to large (small) values of A and vice versa. Equivalently, small valnes of A 
correspond to large valnes of weights re, and large values of A correspond to 
small valnes of re. 






64 


Deep Learning Architectures, A Mathematical Approach 



Figure 3.3: The cost function C(w,b) has a minimum at point M; its level 
surfaces are given by Sk — {C(w^b) — k}. One of these level surfaces is 
tangent to the surface {U(w) — c} at a point where the gradients VC and 
S7U are collinear. 

(ii) Assume X7 w C(w*,b*) — 0. Then either A = 0, or c — 0. The condition 
c — 0 is equivalent to w = 0 , which implies that w * = 0 , namely the cost 
function C(w, b) has a (global) minimum at w = 0 . Since most cost functions 
do not have this property, we conclude that in this case A = 0. 

The role of the constraint Let w * = arg min C(w. b) and assume w * A 0. 

U(w)<c 

There are two cases: 

(a) , w* E {w;U(w) < c}. This corresponds to the case when c is large 
enough such that the minimum of C(w,b) is contained into the interior of 
the level surface {U(w) — c}. See Fig. 3.4 a. In this case we choose A = 0. 

(b) . re* ^ {w]U(w) < c}. This describes the case when c is small 
enough such that the minimum of C(w,b) is outside of the closed domain 
D c = { U(w ) < c}. In this case the minimum of C(w,b ) over the domain 
D c is realized on the domain boundary, along the level surface {U(w) = c}, 
at a point re** through which passes a level curve of C(w,b) tangent to 
{U(w) = c}, see Fig. 3.4 b. The unknowns, re** = (wi ,... ,w n ) T and A, sat- 
isfy the equations 

d Wk C(w **) = -A d Wk U(w**), U(w **) = c. 

In order to obtain an approximation for the solution of the previous equation, 
we minimize G(w, b) — C(w , b) — XU(w) using the gradient descend method. 
The hyperparameter A is tuned such that the minimum of G(w) is the lowest. 

Overfitting removal We have seen that in the case when the training error 
Cp(w , b) is small while the test error Ct(u;, b) is large, we deal with an overht. 


Cost Functions 


65 




Figure 3.4: The cost function C(w,b) in two distinguished possibilities: 
a. w* E { w ; U(w) < c} b. rc* ^ {rc; U(w) < c}. 


One way to try to fix the situation is to require the weight vector w to be 
small. In this case both test and training errors are close to each other. The 
argument follows from the fact that if we let w —?> 0, then the output variable 
is Y — f w fi(X) ~ fofi(X) — cj)(b ( L )), where 0 is the activation function and 
b is the bias vector of the neurons on the last layer. We note that cj)(b^) 
is independent of the input X, either if it is of test or of training type. If 
the cost function is given by the cross-entropy between the densities of the 
model output Y and target Z, then the test and the training errors have 
approximately the same value because 

Ct{ 0,&) = Sr(py,Pz) ~ St(py,Pz) = Ct(0,6). 

We have also used that (X, Z) for both test and training samples are drawn 
from the same underlying distribution. We can formalize this mathematically 
by minimizing the cost function C(w, b) subject to the constraint U(w) < c, 
with c > 0 small enough, as explained in the previous sections. 

3.14 Summary 

During the learning process, neural networks try to match their outputs to 
given targets. This approximation procedure involves the use of cost functions, 






66 


Deep Learning Architectures, A Mathematical Approach 


which measure the difference between what it is predicted (output) and what 
it is desired (target). These proximities are of several types, depending of 
what the network tries to learn: a function, a random variable, a probability 
density, etc. Some of these cost functions are real distance functions (i.e., 
they are nonnegative, symmetric, and satisfy the triangle ineqnality), while 
others are not, such as the KL-divergence or the cross-entropy. However, they 
ali measure the departure of the output from the target by a nondegenerate 
nonnegative function. 

Sometimes, for overhtting removal purposes, the cost functions are aug- 
mented with regularization terms. These terms depend on the weights and 
have the task of minimizing the cost function subject to weights of small 
magnitude. The coefficient of the regularization term is a hyperparameter, 
which Controls the trade-off between the size of the weights and the minimum 
of the cost function. The valne of this hyperparameter is obtained using a 
validation set. This set is independent of the training and test sets. 

The available data is split into three parts that are used for the following 
purposes: training, testing, and validation. When the cost function is com- 
puted using data from the training set and gets optimized, the training error 
is obtained. When testing data is used, the test error is obtained. A small 
training error and a large test error is a sign of overhtting. A large training 
error signals an underht. The purpose of regularization is to obtain a lower 
test error. 

3.15 Exercises 

Exercise 3.15.1 Let p,Pi,q,qi be density functions on M and a E M. Show 
that the cross-entropy satisfies the following properties: 

(a) S(pi + p 2 ,q) = S(pi,q) + S(p 2 ,q); 

{b) S(ap, q) = aS(p, q) = S(p, q a ); 

(c) S(p,qiq 2 ) = S(p, qi) + S(p,q 2 ). 

Exercise 3.15.2 Show that the 
inequality: 

S(p,q ) > 1 

Exercise 3.15.3 Let p be a bxed density. Show that the symmetric relative 
entropy 

Dkl(p\\q) + Dkl(q\ | p) 

reaches its minimum for p — q, and the minimum is equal to zero. 


cross-entropy satisfies the following 

7 pix)qix)dx - 



Cost Functions 


67 


Exercise 3.15.4 Consider two exponential densities, p\ (x) — £ 1 e ^ x and 
P 2 (x) — ^ 2 e ~^ x , x > 0 . 

(a) Show that D KL (pi\\p 2 ) = ^ ~ In ^ - 1 . 

(b) Verify the nonsymmetry relation Dkl(pi\\P2) ^ Dkl(P2\\pi)- 

(c) Show that the triangle inequality for Dkl does not hold for three arbitrary 
densities. 

Exercise 3.15.5 Let X be a discrete random variable. Show the inequality 
H{X) > 0. 

Exercise 3.15.6 Prove that if p and q are the densities of two discrete ran¬ 
dom variables, then Dxl(p\\q) < S(p, q ). 

Exercise 3.15.7 We assume the target variable Z is £-measurable. What is 
the mean squared error fnnction valne in this case? 

Exercise 3.15.8 Assume that a neural network has an input-output function 
f Wj b linear in w and b. Show that the cost fnnction (3.3.1) reaches its minimum 
for a unique pair of parameters, (re*, 6 *), which can be computed explicitly. 

Exercise 3.15.9 Show that the Shannon entropy can be retrieved from the 
Renyi entropy as 

H(p) = lim H a (p). 

a—tl 

1 — t 2 

Exercise 3.15.10 Let d) a (t) — d- e 2 ^ 2 . Consider the convolution opera- 

x ' v 27rcr 

tion (/ * g)(u) = j f(t)g(t - u ) dt. 

(a) Prove that <fi a * 4> a = (p^', 

( b ) Find 4> a * </v in the case a j-a'. 


Exercise 3.15.11 Consider two probability densities, p(x) and q(x). 
Cauchy-Schwartz divergence is defined by 


The 


D C s(p,q ) = -ln 


fp(x)q(x) dx 


■Jf j>(x ) 2 dx f c/(x ) 2 dx) 


Show the following: 

(a) Dcs(p, q) = 0 if and only if p = q: 

(b) D C s(p,q) > 0 ; 

(c) D C s(p,q ) = D C s{q,p ); 

(d) D C s{p,q ) = -in f pq ~ 7> H >(p) 

quadratio Renyi entropy. 


-H 2 (q), where H 2 (•) denotes the 

Zj 







68 


Deep Learning Architectures, A Mathematical Approach 


Exercise 3.15.12 (a) Show that for any function / G L 1 [0,1] we have the 
inequality ||tanh/||i < n/iii- 

(6) Show that for any function / G L?\ 0,1] we have the inequality 11tanh11 2 < 



Exercise 3.15.13 Consider two distributions on the sample space X — 
{xi, X 2 } given by 





Consider the function 0 : X —> R 2 dehned by <j)(x 1 ) = (0,1) and </>(x 2 ) = 
(1,0). Find the maximum mean discrepancy between p and q. 



® 

Check for 
updates 


Chapter 4 

Finding Minima Algorithms 


The learning process in supervised learning consists of tuning the network 
parameters (weights and biases) until a certain cost function is minimized. 
Since the number of parameters is quite large (they can easily be into thou- 
sands), a robust minimization algorithm is needed. This chapter presents a 
number of minimization algorithms of different flavors, and emphasizes their 
advantages and disadvantages. 


4.1 General Properties of Minima 

This section reviews basic concepts regarding minima of functions having a 
real variable or several real variables. These theoretically feasible techniques 
are efficient in practice only if the number of variables is not too large. How- 
ever, in machine learning the number of variables is into thousands or more, 
so these classical theoretical methods for finding minima are not lucrative for 
these applications. We include them here just for completeness, and to have 
a basis to build on the next more sophisticated methods. 


4.1.1 Functions of a real variable 


A well-known calculus resuit States that any real-valued continuous function 
/ : [a,b\ -A R, defined on a compact interval is bounded and achieves its 
bounds within the interval [a, &]; then there is (at least) a valne c G [a, b} such 
that /(c) = mm xe ^ a ^ f(x). This is a global minimum for f(x). However, 
the function might also have local minima. Furthermore, if the (local or 
global) minimum valne is reached inside the interval, i.e., c G (a, 6), and if the 
function is differentiable, then Fermaffs theorem States that f'(c ) = 0, i.e., the 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_4 


69 






70 


Deep Learning Architectures, A Mathematical Approach 


derivative vanishes at that value. Geometrically, this means that the tangent 
line to the graph of y — f(x) at the point (c, /(c)) is horizontal. However, this 
condition is necessary but not sufficient. If the function has the additional 
property to be convex (i.e. to satisfy f"(x) > 0 ), the aforementioned condition 
becomes also sufficient. 


4.1.2 Functions of several real variables 

Assume it is a compact set in R n , i.e., it is a closed and bounded set. For 
instance, K — [ai,b\\ x • • • x [a n ,&J, with G K, or K — {x ; \\x\\ < R}, 


with R > 0. If / : K —> R is a continuous function, then there is a point 
c E K such that /(c) = mm xe x /(#), l - e -i fhe function has a global minimum. 
Similarly with the one-dimensional case, if the (global or local) minimum c 
is in the interior of K (i.e., if there is possibility to center a ball with small 
enough radius at c, which is contained in K), then the following system of 
partial differential equations holds : 1 

df 


dxi 


(c) = 0 . 


i = 1 


n. 


(4.1.1) 


This can be written equivalently in the gradient notation as V/(c) - 0, 
where v/ = H ei is a vector with components given by the partial 

derivatives. We have denoted = (0,..., 1,..., 0) T , the unit vector pointing 
the ith Cartesian direction (we used T for the transpose notation). The system 
(4.1.1) is equivalent with the condition that the tangent plane at (c, /(c)) to 
the surface z = f(x) is horizontal, i.e., parallel to the x-hyperplane. 

The second-order Taylor approximation of f(x) in a neighborhood of 
x — c is given by 

f(x ) = /(c) + - Ci ) 


+ oZ 


d 2 f 


2 ' dxjdxk 

hj 

,T’ 


(c)(Xj -Cj)(x k -c k ) + o( 


x — c 


= /(c) + (x - c) 1 V/(c) + ^(x - c) T Hf(c)(x - c) + o(\\x - c\\ z ) 


where IIj — 


d 2 f 


dxjdxk 


is the Hessian matrix of /. Assume the following: 


(i) c satisfies the equation V/(c) = 0; 

(ii) The Hessian is positive definite and nondegenerate, i.e. 


v T H f v >0, \/v G R n \{0}. 


1 This is sometimes called the Euler system of equations. 



















Finding Minima Algorithms 


71 


Heuristically speaking, if Hf is positive definite, then for x close enough 
to the critical point c, we may neglect the quadratio term in the Taylor 
expansion and obtain f(x) > /(c), for x ^ c. This means that c is a local 
minimum for /. In the following we shall supply a more formal argument for 
supporting this statement. 


Propositiori 4.1.1 Let c be a solutiori of V/(c) = 0 and assume that Hf is 
positive definite in a neighborhood of c. Then c is a local minimum of f. 


Proof: Let xft) be an arbitrary hxed curve in the domain of / with x ( 0 ) = c, 
and consider the composite function gft) — f(xft)). To show that c is a local 
minimum for f(x) is equivalent to prove that t — 0 is a local minimum for 
g(t ), for any curve xft) with the aforementioned properties. 

Let v — x f (0) be the velocity vector along the curve xft) at t — 0. Then 
the directional derivative of / in the direction u is given by 


D u f(c) = lim 

t—> 0 


- /(c) 


t 


— lim 

o 


g(t) - g( 0 ) 


t 



On the other side, an application of the chain rule yields 

D u f(c) = Y J^(®(0)®'k(%= o = (V/(c), v) = v T Vf{c) = 0. 

From the last two relations it follows that g'( 0) = 0. In order to show that 
t — 0 is a local minimum for gft ), it suffices to prove the inequality g"(0 ) > 0 
and then apply the second derivative test to the function gft) at t = 0. The 
desired resuit will follow from the positive dehniteness of the Hessian Hf and 
an application of the formula ^"(O) = u T Hf(c)u, which will be proved next. 

Iterating the formula D u f = u T Vf, we have 

9 "( 0) = D 2 J(c) = D u (D u )(c) = D u (u T S/f)(c) = DfYXJ u k)(c ) 

k 

= Y dx k (■ D uf) (c) U k = Y dx k ( Y dx 3 f U i) ( C ) U k 

k k j 

= Y^ dx 3 x kf)i c ) u i u k = u l Hf{c)u. 

3,k 


Therefore, ^(0) = 0 and ^(O) > 0, and hence c is a local minimum of /. 



72 


Deep Learning Architectures, A Mathematical Approach 


Example 4.1.2 (Positive definite Hessian in two dimensions) Con- 
sider a twice differentiable function, /, with continuous derivatives on R 2 . Its 
Hessian is given by the 2x2 matrix 


Hf(x,y ) 




5 


where we used notation f xx — 


y/ 

dx 2 


in x. For any vector u — (a, b ) G R 2 
provides the quadratic form 


to denote the double partial derivative 
a straightforward matrix multiplication 


U Hf U — fxx^ T 2 fxydb T fyyb 


which after completing the square, becomes 


u 


T 


H 


f 


u 


fxx ( a + 7“^) + 

\ / nr* rp / 

fj (Av (Av 

fxx (a + + 

\ / /T» /T» / 

fj (AV (AV 




The following concluding remarks follow from the previous computationi 

(i) If f xx > 0 and detFfy > 0, then u T H f u > 0 for any u G R 2 , i.e., the 
Hessian Hf is positive definite. 

(ii) If f xx < 0 and det Hf > 0, then u T H f u < 0 for any w G R 2 , i.e., the 
Hessian Hf is negative definite. 


Example 4.1.3 (Quadratic functions) Let A be a symmetric, positive 
definite, and nondegenerate n x n matrix and consider the following real- 
valned quadratic function of n variables 

f(x) — x T Ax — 2 b T x + d, x G R n , 

with b G R n , d G R, where x T Ax — JT • ciijXiXj and b T x — ^yb^x^- We 
have the gradient V/(x) = 2 Ax — 2 b and the Hessian Hf — 2A, so the 
function is convex. The solntion of V/(x) = 0, which is c = is a local 

minimum for /. Since the solntion is unique, it follows that it is actually 
a global minimum. The invertibility of A follows from the properties of the 
Hessian, whose determinant is nonzero. 

If the previous quadratic function f(x) is defined just on a compact set 
K , sometimes the solution c = A~ 1 b might not belong to K , and hence we 
cannot obtain all minima of / by solving the associated Enler system. On 
the other side, we know that / achieves a global minimum on the compact 
domain K. Hence, this minimum mnst belong to the boundary of iF, and 
finding it requires a boundary search for the minimum. 



Finding Minima Algorithms 


73 


Example 4.1.4 (Harmonic functions) A function f(pc) is called harmonic 

Tr^d 2 f(x) 

on the domain D C R n if A x f — 0, \/x G D , where A x f — Q 2 ' lS 

i 1 

Laplacian of /. The minimum (maximum) property of the Laplacian States 
that a harmonic function achieves its minima (maxima) on the boundary 
of the domain D. In other words, the extrema of a harmonic function are 
reached always on the boundary of the domain. 

Since any affine function f{x) — b T x + d, b G R n , d G R is harmonic, 
the minima (maxima) of / are reached on the boundary of i?. If i? is the 
interior of a convex polygon, then the minima (maxima) are achieved at 
the polygon vertices. Looking for the solution vertex is the basic idea of the 
simplex algorithm. 

We have seen that minima cannot be always found by solving the System 
V/(x) = 0 . And even if it were possible, this analytical way of finding the 
minima is not always feasible in practice due to the large number of variables 
involved. In the following we shall present some robust iterative methods 
used to approximate the minimum of a given function. 

4.2 Gradient Descent Algorithm 

The gradient descent algorithm is a procedure of finding the minimum of a 
function by navigating through the associated level sets into the direction of 
maximum cost decrease. We shall present first the ingredient of level sets. The 
reader who is not interested in daunting details can skip directly to section 
4.2.2. 

4.2.1 Level sets 

Consider the function z = /(x), x G D C R n , with n > 2 and deffiie the set 
S c = / -1 ({c» = {x G R n ; f(x) — c}. Assume the function / is differentiable 
with nonzero gradient, V/(x) /0, xG <S C . Under this condition, S c becomes 
an (n —l)-dimensional hypersurface in R n . The family {<S C } C is called the level 
hypersurfaces of the function /. For n — 2 they are known under the name of 
level curves. Geometrically, the level hypersurfaces are obtained intersecting 
the graph of z — f(x) with horizontal planes {z — c}, see Fig. 4.1. 

Propositiori 4.2.1 The gradient V/ is normal to S c . 



74 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.1: Level sets for the function z — f(z). 



Figure 4.2: Local parametrization 4> : U. —> R n of the hypersurface S c . 
































Finding Minima Algorithms 


75 


Proof: It suffices to show the relation locally, in a coordinates chart. Consider 
a local parametrization of S c given by 

x 1 = ... ,u n -i) 


X n = <h n (iq,. . . ,U n -i), 

which depends on n— 1 local parameters u — (iq,..., u n - 1 ) G Z4 C R n_1 , with 
4> = (T 1 ,..., 4> n ) : U C R n_1 -0- R n , having the Jacobian of rank n — 1. The 
tangent vector fields to S c are given by { J^-,..., }. The rank condition 

on <f> implies that they are linearly independent, and hence they span the 
tangent space at each point of S c . We shall show that the gradient vector 
V/ is normal to each of these tangent vectors. This follows from taking the 
derivative with respect to U{ in the relation 


f(&(ui,...,u n -i)) =c, VweW, 


and applying the chain rule 



£'<*<»»=s 

* k 


df_ 

dx k 


d$ k (u) 

dui 

< r(ti) 



<9$ 

dui 


5 


where (, ) denotes the Enclidean inner product. This shows that the gradient 
V/ is normal to the tangent space at each point 4>(?i). ■ 

In equivalent notations, for each x G <S C , the vector V/(x) is normal 
to the tangent plane, T X S C , of S c at x, see Fig. 4.2. The orientation of the 
hypersurface is chosen such that V/ points into the outward direction. The 
tangent plane, T X S C , acts as an infinitesimal separator for the points about 
x which are inside and outside of <S C , see Fig. 4.3 a. 

Assume now that the function z = f(x) has a (local) minimum at x* G D, 
i.e., /(x*) < /(x), for all x G V\{x*}, with V neighborhood of x* (we may 
assume that V is a ball centered at x*). Denote by z* = /(x*) the local 
minimum valne of / at x*. Then there is an e > 0 such that S c C V for any 
c G + c), see Fig. 4.1. For c = z* the hypersurface degenerates to a 

point, S c = {x*}. For small enough e the family {S c } c e[z*, z *+e) ' ls nested, i.e., 
if c\ < C2 then S Cl C Int(5 C2 ). 

The next resuit States the existence of curves of arbitrary initial direction, 
emanating from x*, which evolve normal to the family S c , see Fig. 4.3 b. 
Recall that a function (f>(x) is called Lipschitz continuous if there is a constant 
K > 0 such that |</>(x) — (f>(y)\ < K ||x — y ||, for all x and y in the domain of 
(j). The following two existence results will use this assumption. 








76 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.3: a. Gradient V/ normal to S c . b. The curve a v (t) starting at x* 
with initial velocity v. 


Lemma 4.2.2 Assume V f is Lipschitz continuous. For any vector v G R n , 
there is S > 0 and a differentiable curve a : [0, 5) -G W 1 such that: 

(i) «(0) = x*; 


(ii) (4(0) = v; 

(iii) a(t) is normal to Sf( a ( t p, for all t G [0,4). 


Proof: Let x* = (x|,..., x*), v T — (xi,..., x n ), and cf) k (x) = . It suffices 

to show the existence of the curve components. Picard-Lindelof theorem of 
existence and uniqueness of Solutions States that the nonlinear system with 
initial conditions 


a k (t) — cj) k (a(t )) 

0^(0) — xl 
a k (0) = v k . 

can be solved locally, with the solntion a k : [0, 4^) —> R. Set 4 = min/e 4^ 
and consider the curve a(t) — (<a 1 (t),..., a n (t)), with 0 < t < 4. The curve 
obviously satisfies conditions (i) and (ii). Since a(t) has the direction of 
the gradient, V/, it follows that a(t) is normal to the family i.e., 

condition (iii) holds. ■ 

For future reference, when we would like to indicate also the initial direc¬ 
tion x, the curve constructed as a solution of the aforementioned ODEs Sys¬ 
tem will be denoted by a v (t). 

It is worth noting that the curve is unique up to a reparametrization, 
i.e., the curve can change speed, while its geometric image remains the same. 

















Finding Minima Algorithms 


77 



Figure 4.4: a. There is a ball £>(x*,e) included in A$. b. The sequence (xjf)k 
tends to x* along the curve x(s). 


One distinguished parametrization is over the level difference parameter r = 
c — z*. This follows from the fact that for small values of t we have c(t) > 0. 
If the curve in the new parametrization is denoted by /3 (t), then /3(0) = 
x* = a(0) and /3'(r) — a(t(r))t' (t) , so ^(O) = t'(0)v. We also have the 
convenient incidence relation /3(r) = /3(c — z*) G <S C , which can be written 
also as /(/3(r)) = c. 

Now we state and prove the following local connectivity resuit, which will be 
used later in the gradient descent method: 

Theorem 4.2.3 Assume V/ is Lipschitz continuous. For any point x° close 
enough to x* there is a differentiable curve 7 : [0,5] —> R n such that 

(i) 7(0) = x°; 

(ii) 7 (5) — x* ; 

(iii) 7 (s) is normal to <$/( 7 ( s )p for all s G [0,5). 


We shall provide a nonconstructive proof. We prepare hrst the ground 
with a few notations. Let a v : [0, S v ) -G R n be the curve provided by Lemma 


4.2.2. Assume the initial vector sub-unitary, \\v\\ < 1. Since the end value 5 V 
is continuous with respect to x, it reaches its minimum on the unitary ball 


as 5 — minii-yi^i 5 V . Denote now by As — {a v (t)] t G [0, S),Vv G R n , ||x|| < 1}. 
The set As denotes the solid domain swapped by the curves a v (t), which 
emanate from x* into all directions v, until time 5. The value of S was chosen 
such that the dehnition of As makes sense. The set As is nonempty, since 
obviously, x* G A$. The next resuit States that in fact As contains a nonempty 
ball centered at x*, see Fig. 4.4 a. This is actually a resuit equivalent to the 
statement of Theorem 4.2.3. 














78 


Deep Learning Architectures, A Mathematical Approach 


Theorem 4.2.4 There is an e > 0 such that F>(x*,e) C A§. 


Proof: We shall provide first an empirical proof. The fact that F>(x*, e) C A$ 
means that x* is an interior point of A§. If, by contradiction, we assume 
that x* is not an interior point, then there is a seqnence of points (x&)& 
convergent to x* such that x& ^ A$. Here is where the empirical assumption 
is made: assume that xr lay on a smooth curve x(s), which starts at x(0) = x* 
and satishes x(s&) = x&, with Sk decreasing sequence of negative numbers, 
see Fig. 4.4 b. Let v° — x / (0) be the direction under which the curve x(s) 
approaches the point x*. Lemma 4.2.2 produces a curve a v o starting into 
this direction, which will coincide with x(s) on a neighborhood. This follows 
from the fact that both curves x(s) and a v o have the same initial points and 
velocities. Since in this case we would have x(s) = a v o(— s) G As, for — s < 4, 
this leads to a contradiction. ■ 


We make two remarks: 

1 . All points that can be joined to x* by a curve 7 satisfying the afore- 
mentioned properties form the basin of attraction of x*. The theorem can 
be stated equivalently by saying that the basin of attraction contains a ball 
centered at x*. 

2. The direction x° = x / (0) is a degenerate direction. The previous proof 
shows that there are no degenerate directions. A formal proof of this fact can 
be done using the Inverse Function Theorem (see Theorem F.l in Appendix) 
as in the following. 


Proof: Consider B n = {v G R n ; ||x|| < 1} and dehne the function F : B n -G As 

by F(v) — a v (5). The existence of degenerate directions is equivalent to 

dF . 

the fact that the Jacobian is degenerate. This follows from the Inverse 


dv 


dF 


Function Theorem, which States that if the Jacobian —— is nondegenerate, 

dv 

then F 1 is a local diffeomorphism, i.e., there is a neighborhood V of F{ 0) = x* 
in As and an 0 < p < 1 such that FW 0 ,p) : B(0, p) -G V is a diffeomorphism. 2 
For details, see section F.l of Appendix. Conseqnently, x* E F(B(0,p)) C 
Vci^. We may choose a ball centered at x* of radius e such that B(x*, e) C 
(B(0,p)), which ends the proof. 


So, it suffices to show that the Jacobian of F is nondegenerate. Note 
dF 

that is an n x n sqnared matrix. The degeneracy is equivalent with the 
dv 

existence of a nonzero vector w — v — v° such that 


dF 

dv 




! This means that F\ 


»p) 


is bijective with both F and its inverse differentiable. 











Finding Minima Algorithms 


79 




Figure 4.5: The existence of a cut-point: a. The curves a v and a v o are 
transversal. b. The curves a v and a v o are tangent. 


Using the linear approximat ion 


F(v) = F(v°) + ^-(v - v u ) + o( 

av 


dF 


o 


v — V 


0 


we obtain that F(v) — F(u u ) + o(\\v — u u || 2 ), i.e. we have a v (S) — a v o(5) + 
o(\\v — u°|| 2 ). Neglecting the quadratio term, we shall assume for simplicity 
that a v (S) — a v o (4), i.e., there is a “cut-point” where two curves with distinet 
initial velocities, v and u°, intersect again. We need to show that the flow 
{a v } v is free of cut points on a neighborhood of x*. By contradiction, assume 
there are cut points in any neighborhood, A/", of x*. Let p be one of them, so 
p — a v (S) = cq,o(5) G S c , with c = f(p). Assume the curves are transversal at 
the cut-point p, see Fig. 4.5 a. Since a v (5) and cqpfA) are both normal to 5 C , 
the curves have to be tangent at p. Now, by a similar procedure as in Lemma 
4.2.2, we have two curves intersecting at the same point, p, and having the 
same direction, see Fig. 4.5 b, so they have to coincide on a neighborhood A/ 2 - 
Choosing JV\ C A/ 2 , it follows that the curves coincide, which is contradictory. 


Proof of Theorem 4.2.3. The proof of Theorem 4.2.3 follows by choosing 
in the ball B(x*,e) provided by Theorem 4.2.4. ■ 

The previous nonconstructive proof provides the existence of a curve of 
steepest descent , 7, from x° to x*, which intersects normally the level sets 
family S c . However, this continuous resuit is not useful when it comes to 
computer implementations. In order to implement the curve construction we 
need to approximate the curve 7 by a polygonal line = 
satisfying the following properties: 


ry* 0 rp 1 
tt/ «A/ 


X 

















80 


Deep Learning Architectures, A Mathematical Approach 


(i) x k E S Ck . 

(ii) C &+1 < Cjt, for ali k = 0 ,. .., m — 1 ; 

(ra) the line x J x J+1 is normal to <S Cj . 

The construction algorithm goes as in the following. We start from the 
point x° and go for a distance 77 along the normal line at S CQ into the inward 
direction, or equivalently, in the direction of —V/ at point x°. Thus, we 
obtain the point x 1 G S Cl . We continue by going again a distance 77 along the 
normal line at S Cl into the inwards direction, obtaining the point x 2 . After 
m steps we obtain the point x m , which we hope to be in a proximity of x*, 
and hence a good approximation of it. 


But how do we choose m, or in other words, how do we know when to stop 
the procedure? The algorithm continues as long as < C&, i.e., the landing 
hypersurfaces are nested. For any a priori fixed 77 > 0 there is a smallest m 
with properties (i)-(iii). This means that we stop when x m+1 lands on a 
hypersurface S Cm+1 that is not nested inside of the previous hypersurface 
S Cm . The smaller the step 77 , the larger the stopping order m, and the closer 
to x* it is expected to get. When 77 —> 0, the polygonal line tends toward 
the curve 7 . 


If the algorithm stops after m steps, the upper and lower error bounds 
are given by 


0 * 

7 » _ 7 » 


— rnrj < 


m _ * 

T «T/ 


< dia(<S c J 


(4.2.2) 


where dia(<S c ) denotes the diameter of <S C , i.e., the largest distance between 

any two elements of S c - It also makes sense to consider a bounded number of 

0 ^ 

steps with m < -. 

77 

The left inequality follows from the fact that any polygonal line is longer 
than the line segment joining its end points 


0 * 

X _ ry» 


< 


_ ry* 

tXy 


1 


+ 


= rnrj + 


ryjn _ 

*X/ t AD 


1 

ry* _ ry * 4 

T tX/ 

* 


+ ••• + 


X 


m —1 


— X 


m 


+ 


m _ * 

tX/ *X/ 


This becomes identity if the polygonal line is a straight line, a case which 
occurs if the hypersurfaces are hyperspheres centered at x*. 

The inequality on the right of (4.2.2) follows from the construction of x m , 
which belongs to S Crri . Therefore, we have the estimations 


ry* _ ry* 

tX/ «Xy 


m 


< 


max 

7 / 0 <S Cm 


X 


- y|| < 


sup 

iV Cjyi 


X 


~y\\ = dia(5 Cm ). 


The shrinking condition S Crn —> {x*} allows for diameters dia(<S Cm ) as small 
as possible, provided 77 is small enough. 















































Finding Minima Algorithms 


81 



Figure 4.6: The polygonal line m 
large, b. Step r\ small. 




in two cases: a. Step rj 


We have a few comments regarding the size of ip. 

(i) if 77 is large, the algorithm stops too early, before reaching a good approx- 
imation of the point x*, see Fig. 4.6 a; 

(ii) if r\ is too small, the stopping order m is large and it might not be time 
effective in the case of a computer implementation, see Fig. 4.6 b. 

In practical applications, the size of the step 77 is a trade-off between 
the margin of error and time effectiveness of a running application. We shall 
formalize this idea further in section 4.2.4. 


4.2.2 Directional derivative 


Another concept used later is the directional derivative , which measures the 
instantaneously the rate of change of a function at a point in a given direction. 
More precisely, let v be a unitary vector in R n and consider the differentiable 
function / : U C R n —> R. The directional derivative of / at the point x° E U 
is defined by 


dj_ 

dv 


(x°) = lim 
t\o 


/(x° + tv) — /(x°) 
t 


Note that partial derivatives with respect to coordinates, are directional 

derivatives with respect to the coordinate vectors v — ( 0 ,..., 1 ,..., 0 ) T . An 
application of chain rule provides a computation of the directional derivative 
as a scalar product: 














82 


Deep Learning Architectures, A Mathematical Approach 


df 

dv 



d 

dt 


f(x° + tv ) 


t= 0+ 


fc +1 

t = 0 

t) r v/(x°) = (Vf{x°),v). 


4.2.3 Method of Steepest Descent 

The method of steepest descent (or, gradient descent method ) is a numerical 
method based on a greedy algorithm by which a minimum of a function is 
searched by directing a given step into the direction that decreases the most 
of the value of the function. One can picture the method by considering a 
blindfolded tourist who would like to get down a hili in the fastest possible 
fashion. At each point the tourist is checking the proximity to hnd the direc¬ 
tion with the steepest descent and then make one step in that direction. Then 
the procedure repeats, until the tourist will eventually reach the bottom of 
the valley (or, get stuck in a local minimum, if his step is too small). 


( - "\ 

The steepest descent method 
^_> 



A blinfolded mounfaneen descends from the top 
of a hi|] to the bottom of the valley where the 
cottage waits for him. 

Me always takes the steepest descent path which 
is found locdly using ortly his cone. 


Cartoon 1: A blindfolded man gets off a mountain using the steepest descent 
method by taking advantage of the local geometry of the environment. 

In order to apply this method, we are interested in hnding the unitary 
direction v, in which the function / decreases as mnch as possible within a 
given small step size rj. The change of the function / between the value at 
the initial point x° and the value after a step of size rj in the direction v is 












Finding Minima Algorithms 


83 


written using the linear approximation as 

x° + r]v) — f(x°) — ^—(x^)r]v k + o(r] 2 

ti dXk 

= v(^f(x°),v) + o(r] 2 ). 

In the following, since rj is small, we shall neglect the effect of the quadratic 
term o(r] 2 ). Hence, in order to obtain v such that the change in the function 
has the largest negative valne, we shall use the Cauchy inequality for the 
scalar product 



V/(x°) 


V 


< <V/(s°),v) < ||V/(x u ) 


0 



It is known that the inequality on the left is reached for vectors that are 
negative proportional . 3 Since ||t;|| = 1 , the minimum occurs for 

V/h°) 

V l|V/(x°)|| ’ 

Then the largest change in the function is approximately equal to 

f(x° + rjv) - f(x°) = rj(Vf{x°),v) = -??||V/(x°)| . 


The constant rj is called the learning rate. From the previous relation, the 
change in the function after each step is proportional to the magnitude of 
the gradient as well as to the learning rate. 

The algorithm consists of the following iteration that constructs the fol¬ 
lowing sequence ( x n ): 

i) Choose an initial point x° in the basin of attraction of the global minimum 

,* 

(ii) Construet the sequence ( x n ) n using the iteration 


x n+1 


V/(x n ) 
V/(x") | 


(4.2.3) 


This guarantees a negative change of the objective function, which is given 
by f(x n+1 ) — f(x n ) = — v\\ V/(x n )|| < 0 . 


We note that the line x n x nJrl is normal to the level hypersurface Sf( x ny 
Hence, we obtain the polygonal line = \x° . x 171 ] from the previous 


x 171 } from the previous 


section, which is an approximation of the curve 7 provided by Theorem 4.2.3. 


3 This is more transparent in the case of IR 3 , when (V/(x°),n) = ||V/(x°)|| \\v\\ cos 0. 
The minimum is realized for 0 — 7r, i.e. when the vectors have opposite directions. 























84 


Deep Learning Architectures, A Mathematical Approach 




a b 

Figure 4.7: a. The use of a fixed learning rate rj leads to missing the minimum 
x*. b. An adjustable learning rate r] n provides a much better approximation 
of the minimum . In this case the descent amount is proportional with the 
slope of the curve at the respective points; these slopes become smaller as we 
approach the critical point x*. 


However, this construction has a drawback, which will be fixed shortly. 


Since ||x n+1 — x n \\ = 77 > 0 , the approximation sequence ( x n ) n does not 
converge, so it is easy to miss the minimum point x*, see Fig. 4.7 a. To 
overcome this problem, we shall assume that the learning rate 77 is adjustable, 
in the sense that it becomes smaller as the function changes slower (when the 
gradient is small), see Fig. 4.7 b. We assume now there is a positive constant 
5 > 0 such that the learning rate in the nth iteration is proportional with 
the gradient, r] n — S ||V/(x n )||. Then the iteration (4.2.3) changes into 


x 


71+1_ 


— x n — SV f(x n ). 


(4.2.4) 


Propositiori 4.2.5 The sequence ( x n ) n defined by (4-2.4) is convergent if 
and only if the sequence of gradients converges to zero, V/(x n ) -+ 0 , n 


00. 


Proof: 


u 


4> ” Since the sequence is convergent, 


0 = lim 

n—^00 


n+1 _ n 

th th 


— S lim ||V/(x n ) 

n —^00 


which leads to the desired resuit. 































Finding Minima Algorithms 


85 



Figure 4.8: The graph of z — \y 2 — x. 


“ ” In order to show that ( x n ) n is convergent, it suffices to prove that x n 

is a Cauchy seqnence. Let p > 1. An application of triangle inequality yields 


p 


x 


n+p _ 


X 


n 


< 


n+p _ n+p—1 


+-h 


X 


n+1 


— X 


n 


= *2Fl v/(* n+ o 


3 =0 


Keeping p fixed, we have 


p 


lim 

n—^oo 


X 


n+p _ x n 


< 5 Y lim || V/(x n+J )|| = 0 

^^ n—^oo 


3 = 0 


It is worth noting that if / is continuously differentiable, since x n x*, 
then V/(x n ) —> V/(x*). This agrees with the condition V/(x*) = 0. 

Example 4.2.1 Consider the fnnction / : (0,1) x (-2,2) -0- R, given by 
f(x,y) — \y 2 — x. Its graph has a canyon-type shape, with the minimum 
at the point (1,0), see Fig. 4.8. Let (x°,y°) be a fixed point in the function 
domain. Since the gradient is given by V/(x,y) T = (^, ^) = (—1, y), the 
equation (4.2.4) writes as 


x n+1 = 



x n + 5 

(i - S)y n • 






























86 


Deep Learning Architectures, A Mathematical Approach 




a 


Figure 4.9: Iterations: a. Case 0 < 5 < 1 . b. Case 1 < 5 < 2. 


This iteration can be solved explicitly in terms of the initial point (x°,y°) 
We get 


x n — nS + x° 
y 11 = (1 -S) n y°. 


The sequence ( y n ) n converges for |1 — 6\ < 1, i.e., for 0 < 5 < 2. There are 
two distinguished cases, which lead to two distinet behaviors of the iteration: 


(i) If 0 < S < 1, the sequence ( y n ) n converges to 0 keeping a constant sign 
(the sign of y°). The sequence ( x n ) is an arithmetic progression with the step 
equal to the learning rate S. The iteration stops at the valne n = |_ 1 ~* j, 
where denotes the floor of x, i.e., the largest integer smaller or equal 
to x. See Fig. 4.9 a. 


(ii) If 1 < S < 2, the sequence ( y n ) n converges to 0 in an oscillatory manner. 
This corresponds to the situation when the iteration ascends the canyon walls 
overshooting the bottom of the canyon, see Fig. 4.9 b. 


4.2.4 Line Search Method 

This is a variant of the method of steepest descent with an adjustable learning 
rate r/. The rate is chosen as in the following. Starting from an initial point 
x°, consider the normal direction on the level hypersurface Sf( x o) given by 
the gradient V/(x°). We need to choose a point, x 1 , on the line given by 
this direction at which the objective function / reaches a minimum. This is 
equivalent with choosing the valne t]q > 0 such that 


770 = argmin f(x° — r/V/(x 0 )). 


(4.2.5) 









Finding Minima Algorithms 


87 


The procedure continues with the next starting point OC — X 0 - Vo v/(x°). 
By an iteration we obtain the sequence of points ( x n ) and the sequence of 
learning rates (rj n ) defined recursively by 


r] n — argmin f(x n — 77 V /(x n )) 

X n+1 = x n - r/ n X7f(x n ). 

The method of line search just described has the following geometric signifi- 
cance. Consider the function 


div) = f( x ° - ??V/(x u )), 

and differentiate it to get 

g'(rj) = -<V/(x° + rjVf(x°), V/(x 0 )). (4.2.6) 

If 770 is chosen to realize the minimum (4.2.5), then g'(rjo) — 0, which implies 
via (4.2.6) that V/(x 1 ) and V/(x°) are normal vectors. This occurs when the 
point x 1 is obtained as the tangent contact between the line {x° — r?V/(x 0 )} 
and the level hypersurface Sf( x iy see Fig. 4.10. 

In general, the algorithm continues as in the following: consider the normal 
line at x n to the hypersurface Sf( x n ) and pick x n+1 to be the point where this 
line is tangent to a level surface. This algorithm produces a sequence which 
converges to x* mnch faster than in the case of the steepest descent method. 
Note that the polygonal line infinite and has right angles. 

Before getting any further, we shall provide some examples. 

Example 4.2.6 Consider the objective function /(x) = ^(ax — b ) 2 of a real 
variable x E R, with real coefficients a ^ 0 and b. It is obvious that its 
minimum is given by the exact formula x* = b/a. We shall show that we 
arrive to the same expression applying the steepest descent method. 

The gradient in this case is just the derivative f f (x) = a 2 x — ab. Starting 
from an initial point x°, we construet the approximation sequence 

x n+ i — x n — 6f\x n ) = (1 — 5a 2 )x n + Sab. 

Denote a = 1 — Sa 2 and /3 = Sab. The linear recurrence x n+1 = ax n + /3 can 
be solved explicitly in terms of x^ as 

1 -cd 1 


o 


x n — a n x° + 


For S small enough, 0 < S < we have 0 < a < 1, which implies that a n —> 0, 

CL 

as n —> oo and hence the sequence x n is convergent with the aforementioned 
desired limit 

/3 b 


1 — a 


■/?. 




x = lim x n — 

n—> oo 


1 — a a 





88 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.10: The approximation of the minimum x * by the method of line 
search. 


Example 4.2.7 This is an extension of the previous example to several vari- 

1 

abies. Consider the function / : R fc —> R dehned by /(x) — -||Ax — b || 2 , 

where b E R m and A is an m x k matrix of rank k, and || • || denotes the 
Euclidean norm. 

We note that the minimum x* satisfies the linear system Ax * = b, since 
/(x) > 0 and /(x*) = 0. Multiplying to the left by A T and then inverting, 
we obtain the exact form solntion x* = (A T A)~ 1 A T b. The existence of the 
previous inverse is provided by the fact that the kxk square matrix A T A has 
a maximum rank, since rank(A T A) — rankA = k. The solution x* is called 
the Moore-Penrose pseudoinverse and some of its algebraic and geometric 
significance can be found in section G.2 of the Appendix. 

We shall apply next the method of steepest descent and show that we 
obtain the same previously stated exact solution, i.e., the Moore-Penrose 
pseudoinverse. Since 


/(x) — -1| Ax — b \\ 2 — -(Ax — b) T (Ax — b ) 

-i 

= - [{A T Ax, x) - 2 (A T b, x) + ||&|| 2 ^, 


its gradient is given by V/(x) = A T Ax — A T b. The approximation seqnence 
can be written as 










Finding Minima Algorithms 


89 


x n+l = x n _ 5 V/(x n ) 

= x n -5(A T Ax n -A T b) 

= (I fc - 5A T A)x n + SA T b 
= Mx n + 5A T b , 

where M = — 5A T A, and I*. denotes the unitary matrix. Since the matrix 

I k — M — 8A t A is invertible, an iteration provides 

x n = M n x° + (M n_1 + M n ~ 2 + • • • + M + I fc ) SA T b 
= M n x° + (Ifc — M n )(Jk — M)~ 1 5A T b 
= M n (x°-(A T A)- 1 A T b) + (A T A)- 1 A T b. 

This implies the limit 

x* — lim x n — (A T A)~ 1 A T b, 

n—t oo 

provided we are able to show lim M n — 0. For this to occur, it suffices to 

n—)• oo 

show that all the eigenvalues {A^} of M are real and bounded, with |A^| < 1. 
This can be shown as in the following. Since M — M T , all its eigenvalues are 
real and we have the decomposition M — VDV ~ X , with D diagonal matrix, 
having the eigenvalues {A^} along the diagonal. The nth power of M, given 
by M n — VD n V ~ x , converges to the zero matrix, provided D n —> 0, which 
occurs if \Xi\ < 1. We shall show that 5 can be tuned such that this condition 
holds for all eigenvalues. This will be done in two steps: 

Step 1. Show that A^ < 1. 

As an eigenvalne of M, A^ satisfies the equation det(M — AA/c) — 0. Substi- 
tuting for M, this becomes det ((1 — Xi)Ik — 5A T A^j — 0, which is equivalent 

to det (A t 

of matrix A T A. Since A T A is positive dehnite and nondegenerate, it follows 
that r)i > o, which implies that A^ < 1. 

Step 2. Show that for 6 small enough we have A i > 0. 

The operator F : R n —> R m , dehned by F(w) — Aw is linear and continuous, 
and hence, it is bounded. Therefore, there is a constant K > 0 such that 
\\Aw\\ < iF||rc||, for all w G R n . Choosing 5 < -^ 2 , we have 

5||Ghx|| 2 < ||rc|| 2 , Vw G R n . 

This can be written as Sw T A T Aw < w T w, or equivalently, w t {1]^—8A t A)w > 
0. This means ( Mw , w) > 0 for all w G R n , i.e., the matrix M is nonnegative 
dehnite, and hence A i > 0. 


a - FT Ife) 


— 0. This means that rn — 


1 - Ai¬ 


is an eigenvalne 



90 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.11: A ball rolling downhill without friction and its trajectory repre- 
sentation in the phase space. 


4.3 Kinematic Interpretation 


This section deals with the kinematic interpretation of the method of steepest 
descent. Consider a ball of mass m — 1 which rolls downhill toward the 
bottom of a valley, without friction. The state of the ball is described by 
the pair s = (x,v), where x and v denote the coordinate and the velocity 
of the ball, respectively. The space of States s is called the phase space. The 
dynamics of the ball can be traced in the phase space as a curve s(t) — 
(x(£), v(t)), where x{t) is its coordinate and v(t) — x(t) is its velocity at time 
t, see Fig. 4.11. 

The geometry of the valley is modeled by a convex function z — f(x). We 
consider two types of energy acting on the ball: 

• the kinetic energy due to movement: E = ^||r>|| 2 ; 

• the potential energy due to altitude: E p — f(x). 

The balhs dynamics is described by the classical Lagrangian, which is the 
difference between the kinetic and potential energies: 


L(x, x) = E c - E p = - 


X 


- f( x )- 


The eqnation of motion is given by the Euler-Lagrange variational eqnation 

( —— ) — ——, which in the present case becomes 
V ox J 


dt 


dx 


x(t) = -Vf(x(t)). 


(4.3.7) 


This can be regarded as the Newton law of motion of a unit mass ball under 
a force derived from the potential f(x). 
























Finding Minima Algorithms 


91 


The total energy of the ball at time t is defined as the sum between the 
kinetic and potential energies 


E tot (t) = E k {t) + E p (t) 


x(t) || 2 + f(x(t)). 


(4.3.8) 


Applying (4.3.7) yields 

d _ . N 

jt Etot{t) 


j t (^x(t) T x(t) + f {x(t))^ 

x(t) T x(t) + x{t) T Vf{x) 
x{f) T {x{t) + Vf(x(t))) = 0, 


i.e., the total energy is preserved along the trajectory. We can use this obser- 
vation to construet level sets as in the following. Consider a function defined 
on the phase space by 


H(x, v) 



2 + f(x), 


(4.3.9) 


called Hamiltonian function. Define the energy levels given by the hypersur- 
faces S c associated with the function H by 


S c — H 1 ({c}) = {(x,v)-,H(x,v) = c}. 


The previous computat ion shows that any solution trajectory in the phase 
space, (x(t),u(t)), belongs to one of these level hypersurfaces. 

An equivalent way to look at equation (4.3.7) is to write it as a hrst-order 
system of ODEs 


x(t) — v(t) (4.3.10) 

v(t) — — V/(x(t)). (4.3.11) 

Given some initial conditions, x (0) = x°, u(0) = u°, Standard ODE results 
provide the existence of a unique local solution (x{t),v(t)) starting at the 
point (x°,u°). 

In order to find the minimum of the potential function z = /(x), it 
suffices to find the stable equilibrium point of the ball, which is achieved at 
the bottom of the valley. If the ball is placed at this point, with zero velocity, 
it will stay there forever. This point is an equilibrium point for the ODE 
system (4.3.10)-(4.3.11) and corresponds in the phase space to an energy 
level which is degenerated to a point. Setting x{t) — 0 and v(t) = 0 in the 
previous ODE system provides the equilibrium point (x*,u*) given by 


= 0, V/(x*) = 0. 


(4.3.12) 









92 


Deep Learning Architectures, A Mathematical Approach 


Finding the equilibrium state s* = (x*,v*) can be achieved by applying 
the method of steepest descent to the energy level hypersurfaces S c in the 
phase space. Theorem 4.2.3 provides the existence of a steepest descent curve 
joining the initial state of the ball so = (x°, x°) (initial position and velocity) 
to the equilibrium state s* = (x*, x*), provided so is close enough to s*. 4 The 
iteration (4.2.4) becomes 

• Set an initial state so = (x°,x°); 

• Construet recursively the sequence of States 

Sn+1 = s n - S V s H(s n ), Vn > 0. (4.3.13) 

Using the expression of the Hamiltonian fnnction (4.3.9), its gradient becomes 

v ^W = (“-“) = (v/W.”)' 

Hence, the expression (4.3.13) can now be written on components as 

x n+1 = x n — 5\7 f(x n ) 

x n+1 = (1 -S)v n . 


We have obtained two separated equations, which can be solved indepen- 
dently. The second one has the closed-form solntion v n — (1 — 5) n x°, and 
for S small, we have v n —> n* = 0, n —> oo. The hrst equation is nothing 
but the iteration (4.2.4). Hence, besides a nice kinematic interpretation, this 
approach does not provide any improvement over the method described in 
section 4.2.3. In order to make an improvement we need to introduce a fric- 
tion term, fact that will be done in the next section. The qualitative difference 
between these two approaches will be discussed in the following. 

The Solutions flow, (x(t),x(t)), satisfies the System (4.3.10)-(4.3.11). This 
can be written in terms of the Hamiltonian function, equivalently, as 


x(t) — 


dH 


dv 

■u\ m 

v{t) = -te- 

The tangent vector field to the Solutions flow is defined by X 
Using (4.3.14)-(4.3.15) we compute its divergence 


(4.3.14) 

(4.3.15) 

x(t)£ + 


divX = 


d dH 
dx dv 


d dH 
dv dx 


4 


This proximity condition can be waived if / is a convex function. 










Finding Minima Algorithms 


93 


Since the divergence of a vector field represents the rate at which the volume 
evolves along the flow curves, the previous relation can be interpreted by 
saying that the Solutions flow is incompressible, i.e., any given volume of 
particles preserves its volume during the evolution of the dynamical System, 
see Fig. 4.12 a. In fact, all Hamiltonian flows (solutions of the System (4.3.14)- 
(4.3.15) for any smooth function H ) have zero divergence. Consequently, if 
the ball starts rolling from some initial neighboring States, then at any time 
during the System evolution the States are in the same volume proximity, 
without the possibility of converging to any equilibrium state. A ball which 
rolls downhill without friction in a convex cup will never stop at the bottom 
of the cup; it will continue to bounce back and forth on the cup wall passing, 
without stopping, near the equilibrium point infinitely many times. 


4.4 Momentum Method 

In order to avoid getting stuck in a local minimum of the cost function, several 
methods have been designed. The basic idea is that shaking the System by 
adding extra velocity or energy will make the System to pass over the energy 
barrier and move into a state of lower energy . 0 

We have seen that the gradient descent method can be understood by 
considering the physical model of a ball rolling down into a cup. The position 
of the ball is updated at all times into the negative direction of the gradient 
by a given step, which is the learning rate. 

The momentum method modihes the gradient descent by introducing a 
velocity variable and having the gradient modify the velocity rather than the 
position. It is the change in velocity that will affect the position. Besides the 
learning rate, this technique uses an extra hyperparameter, which models 
the friction, which reduces gradually the velocity and has the ball rolling 
toward a stable equilibrium, which is the bottom of the cup. The role of this 
method is to accelerate the gradient descent method while performing the 
minimization of the cost function. 

The classical momentum method (Polyak, 1964, see [98]) provides the 
following simultaneous updates for the position and velocity 

x n+l = x n + v n+1 (4.4.16) 

v n+1 = nv 11 - rjVf (x n ), (4.4.17) 

where 77 > 0 is the learning rate and /1 E (0,1] is the momentum coefficient. 

5 For instance, shaking a basket filled with potatoes of different sizes will bring the large 
ones to the bottom of the basket and the small ones to the top - this corresponds to the 
state of the system with the smallest gravitational energy. 



94 


Deep Learning Architectures, A Mathematical Approach 


It is worth noting that for /i —> 0, the previous model recovers the familiar 
model of the gradient descent, x n+1 — x n — rj\/ f(x n ). 

4.4.1 Kinematic Interpretation 

We picture agam the model of a ball rolling into a cup, whose equation is 
given by y — /(x), where / is the objective function subject to minimization. 
We shall denote by Ff the friction force between the ball and the cup walls. 
This force is proportional to speed and has its direction opposite to velocity, 
Ff — —px(£), for some damping coefficient p > 0. Therefore, Newton’s law 
of motion is written as 


x{t) — —px(t ) — V/(x(£)). (4.4.18) 

The left side is the acceleration of a unit mass ball and the right side is the 
total force acting of the ball, which is the sum between the friction force and 
the force provided by the potential /. This equation can be used to show that 
in this case the balhs total energy (4.3.8) is not preserved along the solution. 
Since 

J^ (^x(t) T x(t) + f(x(t))J = x(t) T x(t) + x(t) T Vf(x) 

— x(t) T (x(t) + V/(x(£))) = —px(t) T x(t) — —p||x(t)|| 2 
= -p\\v(t)\\ 2 < 0, 

it follows that the total energy decreases at a rate proportional to the square 
of the speed. Hence, Et 0 t(f) is a decreasing function, which reaches its mini¬ 
mum at the equilibrium point of the System. 

The equation (4.4.18) can be written equivalently as a first-order ODEs 
system as 


x(t) — v{t) (4.4.19) 

v(t) = -pv(t ) - Vf(x(t)), (4.4.20) 

where v(t) represents the velocity of the ball at time t. The tangent vector 
field to the solution flow, (x(t),h(t)), has the divergence equal to 

div(x, v) — —p < 0. 

This implies that the solution flow is contracting, the solution trajectories 
getting closer together, converging eventually to the equilibrium point. This 
point is obtained by equating to zero the right terms of the previous system 

v — 0 
—pv — V/(x) = 0. 


(4.4.21) 

(4.4.22) 



Finding Minima Algorithms 


95 




Figure 4.12: Solutiori in phase space: a. The divergence of the tangent flow 
is zero and the volumes are preserved along the flow. The solution oscillates 
forever around the equilibrium point. b. When friction forces are present the 
tangent flow shrinks and the trajectory evolves to lower levels of energy. The 
oscillation is amortized in time. 


The solution (v*,x*) satisfies u* = 0, V/(x*) = 0, which is the same equilib- 
rium point as in the case of no friction case described by (4.3.12). The only 
difference in this case is that, dne to energy loss, the Solutions in the phase 
space move from higher energy levels to lower energy levels, spiraling down 
toward the equilibrium point, see Fig. 4.12 b. 

In order to obtain an algorithm that can be implemented on a computer, 
we shall transform the ODE system (4.4.19)-(4.4.20) into a finite- difference 
system. Consider the equidistant time division 0 = to < t\ < • • • < t n < oo 
and let At — t n +\ — t n be a constant time step. Denote the nth state of the 
system by (; x n v n ) — (x(t n ),v(t n )'). The system (4.4.19)-(4.4.20) becomes 


x n+1 -x n = v n A t 

v n+1 -v n = -pv n At-Vf(x n )At. 

Substituting e — At and /a — 1 — pe, with p < 1, we obtain the hnite-difference 

system 

x n+l = x n + ey n (4.4.23) 

v n+1 = nv n - eV/(/). (4.4.24) 

Rescaling the velocity into v — ev (physically feasible by changing the 
units of measure), the system (4.4.23)-(4.4.24) becomes 









96 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.13: For small upward shifts the roots Xi remain real and situated 
between p and 1 . 


x n+1 = x n + v n (4.4.25) 

v nJrl — fiv n — rj V/(x n ), (4.4.26) 

where 77 = e 2 is the learning rate (the time step is given by At — y/rj). 

We notice the velocity index difference between the equations (4.4.25) 
and (4.4.16), or Polyak’s classical momentum method. This can be fixed by 
replacing in our analysis the backward equation x n+1 — x n — v n At by the 
forward equation x n+1 — x n — u n+1 A t. 

Note that when the function / is quadratio, its gradient, V/, is linear 
and hence the equations (4.4.23)-(4.4.24) form a linear system that can be 
solved explicitly. We shall do this in the next example. 

Example 4.4.1 Consider the quadratio function of a real variable f{x) — 
^(ax + b ) 2 . Since the gradient is f'(x) — a(ax — 5), the system (4.4.23)- 
(4.4.24) is linear: 


x nJr 1 = x n + e v n 

r> n+ i = —ea 2 x n + pv 71 + eab. 

In matrix notation this writes more simply as — Ms n + /3, with 





Finding Minima Algorithms 


97 


Inductively, we can express the state s n in terms of the initial state so as 

s n = M n s 0 + (I 2 - M n )(h - M)~ l (5. (4.4.27) 


A computation provides the inverse 


(I 2 - M)~ l 


1 

e 2 a 2 


1 — (i e \ 
—ea 2 0 J 


Next we shall show that the eigenvalues of M are between 0 and 1. To denote 
the dependence of e, we write M — M{e). Then 



is a diagonal matrix having eigenvalues Ai(0) = /i and A2(0) = 1. The char- 
acteristic equation of M is the quadratio equation 


A 2 — (/i T 1)A T (/i T e 2 a 2 ) — 0, 


with Solutions Ai(e) and A 2 (e). Instead of computing explicitly these eigen¬ 
values we shall rather show that they are between 0 and 1 in the following 
qualitative way. Note that the characteristic equation depends additively on 
e. When e = 0, there are two intersections between the parabola and the Ar¬ 
axis given by Ai(0) = g and A2(0) = 1. For small enough e > 0, the parabola 
shifts up by the small amount e 2 a 2 , and by continuity reasons, there are stili 
two intersections with the x-axis, situated in between g and 1, see Fig. 4.13. 
Hence, 0 < /a < Ai(e) < A 2 (e) < 1. Consequently, M n —> 0, as n —> oo by 
Proposition G.1.2 of the Appendix. 


Taking the limit in relation (4.4.27) yields the equilibrium state 


s* — lim s n — (I 2 — M) 1 f3 — 


1 — /1 


n—^00 


e 2 a 2 V — ea z 0 


0 

eab 


b 

a 

0 


which retrieves the minimum x* = —, as expected. 

a 

Remark 4.4.2 The name of “momentum method” is used in an improper 
way here. A momentum to a particle is usually a push forward, while in this 
case we damp the particle using a friction force. We do this in the effort of 
avoiding the particle to overshoot the equilibrium point. 

However, sometimes we would like the opposite effect: to make the particle 
avoid getting stuck in a local minimum. In this case, the friction factor is 
replaced by a momentum factor meant to give a bust to the particle velocity 
to overshoot the local minimum. This is easily fulfilled by asking the condition 
/i > 1 in equation (4.4.24). 




98 


Deep Learning Architectures, A Mathematical Approach 


4.4.2 Convergence conditions 


In this section we are interested in studying the convergence of seqnences 
x n and v n defined by the momentum method equations (4.4.16)-(4.4.17). To 
accomplish this task we shall find exact formulas for the sequences. 

Iterating eqnation (4.4.16) 


x n = x n ~ l +v n 


x n~l = x n-2 + v n-l 


x 1 = ar + v 1 


and then adding, yields the expression of the position in terms of velocities 


x n — x° + v k . 

k =1 


(4.4.28) 


Denote for simplicity b n — V/(x n ). Iterating eqnation (4.4.17), we obtain 

v n — fiv n ~ l — r]b n -i 

= - Tjb n -2) - vK -1 

= gv n ~ 2 - /J,r]b n -2 - r]b n -i 

o o 

= n (nv n ~ - r/bn-s) - 2 - r/bn -1 

= / ?v n ~ 3 - n 2 r]b n - 3 - nrjb n -2 - rjb n - i- 


It can be shown by induction that 

v n = /j, n v° - V (^ n-1 fto + X~ 2 h H-f nb n -2 + b n -ij. 

In the following it is more convenient to shift indices and use the summation 
convention to obtain the expression 

n 

v n+1 = n n+1 v° ~vY (4.4.29) 

i =0 

In order to understand the behavior of r> n+1 we need to introdnce the follow¬ 
ing notion. 

Definitiori 4.4.3 The convolution series of two numerical series X^n>o 
and o bn is the series ^ n >o c n wit]l th e general term c n = Y^=o a ib n -i = 
a ~ b- 

7 jj—( \ Uj n—i u i • 


Finding Minima Algorithms 


99 


The following two results will be used in the convergence analysis of 
sequences ( x n ) n and ( v n ) n . 

Propositiori 4.4.4 Let (a n ) n be a sequence of real numbers convergent to 0 
and X^n>o be an absolute convergent numerical series. Then 

n 

lim V' 'aib n -i = 0. 

n —>oo A ' 


i=0 


Theorem 4.4.5 Consider two numerical series X^n>o anc ^ X^n>o one 

convergent and the other absolute convergent. Then their convolution series 
is convergent and its sum is equal to the product of the sums of the given 


series, i.e. 


n 


T ( E a i b n-i ) = ( E Q 4 ( E^ 


n> 0 i —0 


n>0 


n>0 


The following resuit provides a characterization for the convergence of 
sequences ( x n ) n and ( v n ) n • 

Propositiori 4.4.6 (a) If V/(x n ) converges to 0, then the sequence ( v n ) n is 
also convergent to 0, as n -+ oo. 

(■ b ) Assume the series X^n>o ||V/(x n )|| * s convergent. Then both sequences 
(; x n ) n and ( v n ) n are convergent. 

Proof: (a) Recall that 0 < fi < 1. Since (6 n ) converges to 0 and the 
geometric series X^n>o absolute convergent, Proposition 4.4.4 implies 

lim n ^ 00 Xa=o T n ~ l ^i — 0- Since [i n is convergent to 0, taking the limit in 
(4.4.29) we obtain 


n 


v * — lim r> n+1 — r>° lim /i n+1 — rj lim /i n l bi — 0. 

n — yoo n—^oo Z—/ 


r n+l 


n—^ oo 


i =0 

( b ) Since X^n>o ||V f(x n )\\ is convergent, then V/(x n ) converges to 0. From 
part (a) it follows that v n is convergent to zero. 

To show the convergence of we use relations (4.4.28) and (4.4.29) and 
manipulate the expressions algebraically as follows: 

n+l n 


X 


n+l 


= X 


0 


V 


— X 


0 


+ E^ = *° + E 

k =1 k =0 

n k 

+£. 

k =0 i=0 


fc +1 


++V - 




n k 




1 — fi 


k =0 i=0 









100 


Deep Learning Architectures, A Mathematical Approach 


This provides a closed-form expression for x n+1 . Taking the limit using The- 
orem 4.4.5 we obtain 


x 


* 


lim x n+1 = + v 0 ——— 

n—^oo 1 — /i 


n>0 n>0 



+ V 


0 I- 1 


1 - 





E |V/(x") 

n>0 


Remark 4.4.7 An improvement of the classical momentum method has been 
proposed by Nesterov [91]. This is obtained by modifying the momentum 
method the argument of the gradient; instead of computing it at the current 
position x n , it is evaluated at the corrected value x n + fiv n \ 

x n+1 = x n + v n+1 (4.4.30) 

v n+1 — gv 71 — rjX7f (x n -\-gv n ). (4.4.31) 

The Nesterov Accelerated Gradient (abbreviated as NAG) is a hrst-order 
optimization method with better convergence rate than the gradient descent. 
Compared with the momentum method, NAG changes velocity v in a more 
responsive way, fact that makes the method more stable, especially for larger 
value of (i . 

4.5 AdaGrad 

A modified stochastic gradient descent method with an adaptive learning 
rate was published in 2011 under the name of AdaGrad (Adaptive Gradient), 
see Duchi et al. [34]. If C(x) denotes the cost function, which is subject to 
minimization, with x G R , then the gradient vector evaluated at step t is 
denoted by gt — VC(x(t)). We consider the N x N matrix 

t 

Gt = E 9t9 t 

T— 1 


and consider the update 

x(t + 1) = x(t) - 7]Gt 1/2 g t , 

where g > 0 is the learning rate. For discrete time steps this can be written 
equi valent ly as 

x n+1 = 


x n - r}G n 1/2 g n . 







Finding Minima Algorithms 


101 


Since G t 1 is computationally impractical in high dimensions, the update 
can be done using only the diagonal elements of the matrix 

x(t + 1 ) = x(t) — 77 diag(G f t) _ 1 // 2 ^t. 


The diagonal elements of Gt can be calculated by 

t 

(Gt)jj ~ l) 5 

T— 1 


where we use that 


9t9 






4.6 RMSProp 

The Root Mean Square Propagation , or RMSProp, is a variant of the gradient 
descent method with adaptive learning rate, which is obtained if the gradient 
is divided by a running average of its magnitude, see Tieleman and Hinton 
[118], 2012 . 

If C(x) denotes the cost function, and gt — VC(x(t)) is its gradient 
evaluated at time step t, then the running average is dehned recursively by 

v(t) = 7 v(t - 1 ) + (1 - 7 )^- 1 , 

where 7 G (0,1) is the forgetting factor, which Controls the exponential 
decay rate and the vector g\_x denotes the elementwise square of the gra¬ 
dient gt- 1- 6 It can be shown inductively that the exponential moving average 
of the squared gradient v(t) is given by 


t 

v(t) = 7 *u( 0 ) + (1 - 7 ) 

2=1 


Since the coefficients sum up to 1 , namely 

t 

7 * + (l- 7 )E^' i = 1 ’ 

2=1 


6 


For discrete time steps this can be written equivalently as 


^ = 7 ^- i + (i- 7 )( 3n _ 1 ) 2 . 





102 


Deep Learning Architectures, A Mathematical Approach 


it follows that v(t) is a weighted mean of x(0) (which is usually taken equal 
to zero) and all the squared gradients until step £, giving larger weights to 
more recent gradients. 

The minimum of the cost function, x* = argmin x C(x), is obtained by the 
approximation sequence (x(t))t >i dehned recursively by the updating rule 

x(t + 1) = x(t) — tj —, (4.6.32) 

VKt)| 


where p > 0 is a learning rate. We note that v(t) can be interpreted as an 
estimation of the second moment (uncentered variance) of the gradient. The 
equation (4.6.32) can be seen as a gradient descent variant where the gradient 
gt is scaled by its Standard deviation estimation, y/\v(t) 


Example 4.6.1 We shall track mathematically the minimum of the real- 
valued function C(x) = \x 2 using the RMSProp method. The gradient in 
this case is gt — x(t) and the moving average is given by 


t 

v(t) = (1 - 7) XI 

i —1 

where we considered x(0) = 0. The sequence that estimates the minimum 
x* = argmin x C{x) is given recursively by 


x(t + 1) = x(t) — 7] 


x(t) 




— x(t) 


1 — 7 ] 




We denote pt — 1 — p/y\v(t)\ and assume that 0 < pt < p < 1. Then the 
relation x(t + l) — x(t)pt together with the initial condit ion x (0) > 0 implies 
that the sequence x(t) is decreasing and bounded from below by 0, satisfying 
the ineqnality 


0 < x(t) < x(0 )p t . 

Talking the limit and using the Squeeze Theorem yields x*= limt^oo x{t) — 0, 
namely the minimum of C{x) — \x 2 is reached for x = 0. 

Now, we go back and show the double inequality 0 < pt < p < 1. The hrst 
inequality, 0 < p^ is equivalent to 2 — < 1. In order to show this inequality, 

vw) 

we shall assume that x(t) does not converge to 0 (because otherwise we 
already arrived to the conclnsion), namely there is e > 0 such that x(£) 2 > e, 
for all t > 1. This implies that there is IV > 1 such that v(t) > |, for all 
t > N. This fact follows from the computation 



V(i) 2 > e(i -7 )X 7< * 


i —1 


i —1 


e(l -7) 


1 — 7* . f , e 

y—=e(!-V) > - 




















Finding Minima Algorithms 


103 


where we choose N such that \ for t > N. The previous inequality 

implies 


V 


2/7 

< — < 1 


Vi > N, 


v 


i -p 


l.e. 


v(t) e 

where we chose the learning rate satisfying 77 < |. 

The second inequality, pt < p < 1, is equi valent to v(t) < ^ 

to the boundness of the sequence v(t). A sufficient condition for this is the 
boundness of the approximation sequence, |x(i)| < M, for all t. This can be 
shown as in the following 

t t 

v (t) = ( 1 - 7 )y^ 7 t_ V(i ) 2 < M 2 (l- 7 )^ 7 ' 


t—i 


2=1 

= M 2 (1-7 4 )<M 2 . 


2=1 


4.7 Adam 

This adaptive learning method was inspired by the previous AdaGrad and 
RMSProp methods and was introduced in 2014 by Diederik and Ba [31]. The 
method uses the estimation of the hrst and second moment of the gradient 
by exponential moving averages and then apply some bias corrections. 

The cost function, (7(x), subject to minimization, may have some stochas- 
ticity build in and is assumed to be differentiable with respect to x. We are 
interested in the minimum x* = argmin x K[C(x) . 

For this, we denote the gradient of the cost function at the time step t by 
gt — \7C(x(t)). We consider two exponential decay rates for the moment esti- 
mates, /?i,/?2 ^ [0,1), fix an initial vector x(0) = xo, initialize the moments 
by m(0) = 0 and x(0) = 0, and consider the moments updates 

m(t) = (3im(t - 1) + (1 - /A )g t 
v(t) = (3 2 v(t- 1) + (1 -/3 2 )(g t ) 2 , 

where (gt) 2 denotes the elementwise square of the vector g t . The moments 
m(t) and v(t) can be interpreted as biased estimates of the first moment 
(mean) and second moment (uncentered variance) of the gradient given by 
exponential moving averages. The bias can be corrected as in the following. 
We can write inductively 

t 

m{t) = (1 - Pi) FjMQi 

2=1 

t 

V(t) = (1 - P 2 ) Yi 02 

2=1 







104 


Deep Learning Architectures, A Mathematical Approach 


Applying the expectation operator and assuming that the first and second 
moments are stationary, we obtain 


E[m(£)] 

E[v(t)} 


t 

(i-(3 1 )j2p t rn9i} = (i-p\)n9t} 

i—1 
t 


(i -/^/r^) 2 ] 

2=1 


(i - timat) 2 }. 


Therefore, the bias-corrected moments are 


m(t) — m(t)/( 1 — P\) 

v(t) = v(t)/{l -Pl). 


The final recursive formula is given by 


x(t + 1) 


x{t ) — 7] 


rh(t) 


VWW\ + 


5 


with e > 0 a small scalar used to prevent division by zero. Some default 
settings for the aforementioned hyperparameters used in [31] are r] — 0.001, 
f3\ — 0.9, /?2 — 0.99, and e — 10 -8 . 


4.8 AdaMax 


Adaptive Maximum method, or AdaMax, is a variant of Adam based on the 
infinity norm, [31]. The model is dehned by the following set of iterations: 


m(t) 

u(t) 

x{t) 


Pim{t - 1) + (1 - (3i)g t 

max(/? 2 ii(t — 1), \g(t)\) 

/ x i? m(t) 

x(t- 1)- 7 W 


l — P{ u{t — 1) 


with moment initializations m(0) = u{ 0) = 0. We note that in this case there 
is no correction needed for the bias. 


4.9 Simulated Annealing Method 

Another method for hnding global minima of a given function is the simu¬ 
lated annealing (SA) introdnced to neural networks in 1983 by Kirkpatrick 
et al. [63]. The method is inspired by a metallurgical process called anneal¬ 
ing. During this process, the metal is tempered, i.e., is overheated and after 








Finding Minima Algorithms 


105 


D 



Figure 4.14: Potential function, phase space representation and final distri- 
bution of particles. 


that is slowly cooled. By this procedure the crystalline structure of the metal 
reaches a global minimum energy, the method being used to eliminate even- 
tual defects in the metal, making it stronger. 

Since the kinetic energy of a molecule is proportional to the absolute 
temperature, see Einstein [38], then the metal molecules get excited at high 
temperatures, and then, during the cooling process, they loose energy and will 
finally end up at positions of global minimum energy in the crystal lattice. 
We shall discuss this behavior in the following from both the kinematics and 
thermodynamics points of view. 

4.9.1 Kinematic Approach for SA 

Consider the energy potential z = f(x), which has some local minima and also 
a global minimum. Assume that the metal molecules have a dynamics induced 
by this potential. For a hxed temperature, T, each particle has constant total 
energy. Consequently, it has to move along constant energy level curves in 
the phase space, see section 4.3. When the temperature increases, the kinetic 
energy of particles increases too, and hence they will translate up to higher 
energy levels. 

When the temperature is slowly decreased, the kinetic energy of particles 
decreases too, the effect on particles movement being similar to the effect of a 
damping factor. This brings the particle to a lower level of energy, hopefully 


































































106 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.15: Most of damped trajectories will spiral in approaching the equi- 
librium point (xjb, 0), which has a larger basin of attraction. 


even lower than the initial one. In the end, most particles will be situated at 
the global minimum state in the phase space. 

We shall discuss this behavior using Fig. 4.14. Consider the potential 
function z = /(x), having a local minimum at the point A and a global 
minimum at C . Assume that a particle is in the proximity of the local min¬ 
imum A at the initial temperature Tq. Then its associated trajectory in 
the phase space will be a small closed loop around the equilibrium point 
(x^, 0). We say that the particle got stuck in a local minimum, and without 
any extra energy the particle cannot escape from the basin of attraction of 
point A. 

Now, we increase the temperature to T max such that the particle has a 
kinetic energy that allows it to go as high as point D. Then, at this stage, 
the associated trajectory in the phase space corresponds to the energy level 
Ejj , which is the largest loop in the figure. Then we start decreasing the 
temperature slowly. During this process the particle has a damped move- 
ment in the phase space, and the energy level Ep will get down to the energy 
level Eb , which has a number eight shape. Hence, after lowering the tem¬ 
perature even more, the trajectory in the phase space will revolve around 
the global minimum (x^,0). When the temperature gets down to zero, the 
damping will bring the trajectory toward the global minimum (x^,0), see 
Fig. 4.15. 

If we consider now a large number of particles to which we apply the 
previous reasoning, then most of them will end up in a proximity of the 
equilibrium point (x^, 0), and only a few will not be able to escape from the 
basin of attraction of point A. This explains the double-peaked distribution 
of particles, p(x), in the lower part of figure Fig. 4.14. The slower the cooling 
process, the smaller the peak above x* A . If the cooling would be infmitely 

















Finding Minima Algorithms 


107 


long, the distribution will be just single peaked, fact that corresponds to the 
case when all particles have a minimum potential energy. It is worth noting 
that when T — T max the distribution of particles is uniform. 


4.9.2 Thermodynamic Interpretation for SA 

We assume now that all particles form a thermodynamic System whose inter- 
nal energy has to be minimized. The following concept of Boltzmann proba- 
bility distribution will be useful. It is known that a system in thermal equi- 
librium at a temperature T has a distribution of energy given by 

E 

p(E) — e~kr. 


where E is the energy level, T is the temperature, and k denotes the Boltz¬ 
mann constant (which for the sake of simplicity we shall take it from now on 
equal to 1), see Fig. 4.16 a. In fact, in order to make the previous formula 
mathematically valid as a probability density, we need to specify an interval, 
0 < E < E m ax , and a normalization constant in front of the exponential. 
For our purposes we do not need to worry about it for the time being. The 
previous distribution is used as in the following. If a particle is in an initial 
state, soj with energy Eo, the probability to jump to a state s i of a higher 
energy, E\, is 

/\ jp 

p(AE) — e~^r ~, 


where A E — E\ — Eo is the difference in the energies of the States. This 
model has a few consequences: 

(i) Consider the temperature T hxed. Then the higher the energy jump, 
A E, the lower the probability of the jump, p(AE). 

(ii) For a given energy jump A E, the higher the temperature, the larger 
the probability of the jump. Consequently, at low temperatures, the proba- 
bility for a given jump size in energy is lower than the probability for the 
same jump size at a higher temperature. 

(iii) For very large T, the jump energies are distributed uniformly. 

These can be stated equivalently by saying that a thermodynamic system 
is more stable at low energies and low temperatures. These observations will 
have an uttermost impact on the following global minimum search algorithm. 

Consider the function E — f(x ), which needs to be minimized globally. 
Here, x plays the role of a state of the system, while E denotes its internal 
energy. We start from a point x°, which is the initial state of the system, 
and set the temperature parameter T to very high. A second point x 1 is 
created randomly in a neighborhood of x°. Then we compute the energy 




108 


Deep Learning Architectures, A Mathematical Approach 




Figure 4.16: a. The Boltzmann distributiori p{x) — e~ x ^ T . b. The reflected 
sigmoid junctioris f{x) — 1+ ^ x /t • Both functions are represented for a few 
different values ofT. 


jump between the States x° and x 1 as A E — /(x 1 ) — f(x°). There are two 
cases that can occur: 

1) If A E < 0, which means /(x 1 ) < /(x°), then we accept the new state 
x 1 with probability 1, and we shall continue the next steps in the search 
from x 1 . 

2) If A E > 0, then we do not discard the state x 1 right away (as a 
greedy algorithm would do). We do accept x 1 with probability p(AE), where 
p denotes the Boltzmann distribution. 

We note that the acceptance probability used in the above two steps can 

JX] 

be written in only one formula as P(AE; T ) = min jl, }. 

In the next step we reduce the temperature parameter and choose ran- 
domly a new point in the neighborhood of the last accepted point; then 
continue with the above steps. 

By this procedure, when T is high, the probability to accept States that 
increase the energy is large. In fact, for T —>> oo, any point can be accepted 
with probability 1. Setting T high in the beginning facilitates the System to 
get out of an eventual local minimum. 

When T gets closer to 0, there is a small probability to accept points 
that increase the energy and a mnch higher probability to choose points that 
decrease the energy. 

We conclude next with the following pseudocode for the simulated annealing: 

1. Fix e > 0. Choose x° and set T high. 

2. Choose a random point x 1 in a neighborhood of x°. 

3. Compute A E — /(x 1 ) — /(x°). 

• if A E < 0, set x 2 — x 1 . 

• if A E > 0, choose a random number r G (0,1). 
















Finding Minima Algorithms 


109 


o if r < e 
o if r > e 


A E 
T 

A E 


Cpf rf> 2 rf‘ 1 

u v> L) tAy tXj • 

then go to step 2. 


4. If 


rpTl -\-1 _ 

«ly d/ 


< e then stop; else reduce T and go to step 2. 

We note that decreasing the temperature too fast leads to premature 
convergence, which might not reach the global minimum. 7 Decreasing the 
temperature too slowly leads to a slow convergence of the algorithm. In this 
case the search algorithm constructs a sequence ( x n ) n which covers the States 
space pretty well and a global minimum cannot be missed. It can be proved 
that using the simnlated annealing method the global minimum of the energy 
function is approached asymptotically, see [1]. 

One way to set the initial temperature is to choose it equal to the average 
of the function /, as T max = Ave(f) — vo g D ^ f D f(x) dx , where D denotes 
the states space. 


Remark 4.9.1 (i) Boltzmann distribution was chosen for convenience and 
for the sake of similarity with thermodynamics. There are other acceptance 
probabilities which work as well as the previous one, for instance, the reflected 
sigmoid function f(x) = 1+ ^ X / T ? see [101]. When T —> oo, the function tends 
to the horizontal line y — 0.5 and its behavior is similar to Boltzmann’s 
distribution, see Fig. 4.16 b. This distribution is of paramount importance 
in the study of Boltzmann machines in Chapter 20. 

(ii) A situation similar to the simnlated annealing method occurs in the case 
of a roulette wheel. When the ball is vigorously spun along the wheel, it 
does not land in any of the numbered slots immediately. These slots act as 
local minima for the potential energy function, and as long as the ball has 
enough energy, it can easily get in and out of these pockets. It is only after 
the balhs energy decreases sufficiently, dne to friction, that it lands in one 
of the pockets, and this occurs with equal probability for all numbers. If we 
assume that one of the slots is deeper than the others, the ball will have the 
tendency of landing in this specific pocket with a higher probability. 


4.10 Increasing Resolutiori Method 

In this section we shall present a non-stochastic method for finding global 
minima of functions that have plenty of local minima. This approach can 
also be performed in several variables, but for the sake of simplicity we shall 
present it here just for the case of a one-dimensional signal, y — /(#), depend- 
ing of a real variable, x. The method works for signals which are integrable 


of 


7 From the Physics point of view, in this case the substance does not reach to the state 
a crystalline structure, but rather to an amorphous structure. 











110 


Deep Learning Architectures, A Mathematical Approach 


and not necessarily differentiable. This technique can be considered as a ver- 
sion of the simulated annealing method, as we shall see shortly. 

The idea is to blur the signal y — f{pc) using a Gaussian filter, G a (x ) = 

J- e sa 2 obtaining the signal fa(x), which is given by the convolntion 

v27TCr v 7 


f(j(x) — (/ * G(j){x) — / f(u)G a (x — u) du — / f(x — u)Gcr(u) du. 

J M J M 

The positive parameter a Controls the signal resolution. A large valne of a 
provides a signal /^(x), which retains a rough shape of f(x). On the other 
side, a small a means to retain details. Actually using the fact that the 
Gaussian tends to the Dirae measure, linvx^o G a (x ) = 5(x), we have 

lim f a (x) = lim(/ * G a )(x) = (f * 5)(x) = f(x). 

cr \0 cr\0 

This means that for small valnes of a the hltered signal is very close to the 
initial signal, fact which agrees with the common sense. 

If one tries to find the global minimum of y — f(x) directly, using the 
gradient descent method, two things would stand in the way: 

1. The function / might not be differentiable and the algorithm can’t be 
applied. For instance, neural nets of perceptrons (multilayer perceptrons) 
belong to this category; 

2. The function / is differentiable, the algorithm applies, but it gets stuck 
in a local minimum. One deficient way to fix this issue is to increase the 
learning rate to allow skipping over local minima. However, this will lead to 
an oscillation around the minimum, which is not a precise approximation. 

The next method avoids the previous two inconveniences. We first note 
that the smoothed function f a (x) is integrable and differentiable with respect 
to x (even if / might not be), satisfying 

f'ai x ) = (f*G' a )(x) 

ll/o-Hl < H/lll- 

The first identity follows from a change of variables and the interchange- 
ability of the derivative with the integral, dne to the exponential decay of the 
Gaussian. For the second inequality, see Exercise 4.17.5. 

The algorithm Since the signal resolution increases as a decreases, we shall 
first find an approximative location of the global minimum, and then increas- 
ing the resolution we look for more accurate approximations of the global 
minimum. This way, we look for the global minimum in more and more accu¬ 
rate proximity of the global minimum. 





Finding Minima Algorithms 


111 



Figure 4.17: The increase in the signal resolution. The minimum on each 
resolution profile is represented by a little circle. 


Consider the resolution schedule 00 > > • • • > &k > 0. Then f ao , ..., f c rfc 

are different resolutions levels of the signal y — f(x), see Fig. 4.17. We shall 
start with a large valne of ctq. Then the signal y — fu 0 ( x ) looks like a rough 
shape of y — f(x), and it does not have any local minima if ctq is large enough. 
Let 


x 0 = arg min f a (x). 

X 


Since f ao is smooth and xq is the only minimum, it can be achieved using 
the gradient descent method. 

We consider the next resolution signal y — f ai (x) and apply the gradient 
descent method, starting from xq, to obtain the lowest value in a neighbor- 
hood Vo of xq at 

x\ — arg min f ai ( x ). 
xOVo 


Next, we apply the gradient descent on a neighborhood Vi of x\ for the signal 
fu 2 and obtain 

X 2 — arg min f a2 (x). 

xeVi 


We continue the procedure until we reach 


Xk — arg min 

xeVk-i 



Due to the increase in resolution we have the descending sequence of neigh- 
borhoods Vo D • • • D Vk-i- Each one contains an approximation of the global 
minimum. If the schedule is correctly chosen, then Xk is a good approximation 
of the global minimum of f(x). 


Remark 4.10.1 1. The parameter a plays a role similar to the temperature 
in the simulated annealing method. 

2. This method is similar to the process of finding a small st ore located next 
to a certain Street intersection in a given city, using Google Maps. First, we 
map the country in a chart Vo; then we increase the resolution to map the 






112 


Deep Learning Architectures, A Mathematical Approach 


city in a chart Vi; increasing the resolution further, we look for the streets in 
a smaller chart V 2 ; magnifying the image even further, we look for the store 
at the prescribed streets’ intersection in a chart V 3 . We obtain the obvious 
inclusion V 3 C V 2 C V 3 C Vo, which corresponds to the resolution schedule 
store, streets, city, country. Neglecting only one resolution level will make the 
search almost impossible. 

Blurring schedules If the signal y = f{x) is blurred using the Gaussian 
filter G ai (x), we obtain the signal f ai — f * G ai . Next, we blur f ai using 
the Gaussian filter G ai (x) and obtain f ai ,a 2 — fer 1 * G a2 . We can also blur / 
directly using the Gaussian G cri+cr2 and obtain the signal fa\+a 2 = f*G a , 

+(T2 ’ 

We shall show in the following that the signal f a1;(J2 has more resoln- 
tion than the signal f ai +a 2 - Using the associativity of convolntion as well as 
Exercise 4.17.6, we obtain 

/ 71 ,( 7 2 = /ai * G(j 2 = (/ * G ai ) * G a2 = / * (G ai * G G2 ) = f * G a = f a , 

where a — \J<j\ + o\. Since a < g\ + < 72 , it follows that f a contains more 
details than f ai +a 2 - Similarly, if we blur the signal n times, we obtain 

/ * G ai * • • • * Ga n = f * Ga , 

with a — \Jo\ + • • • + < eri + • • • + cr n . If consider the constraint 

(J 1 + • • • + & n = s , 

which means that the blurring schedule has given length, s, then by Exercise 
4.17.7, the minimum of a is reached for the case 


(J l — ••• — (7 n — S f71. 


Therefore, blurring the signal in smaller steps preserves more details of the 
initial signal, while blurring it in larger steps loses information. 

It is worth noting the existence of infinite length schedules that provide 
a signal with finite resolution. For instance, the blurring schedule 



1 _ 1 

(Tk- p 


satisfies s = W>i a j = °°> while a 2 = W>i vz = 


7V 


j>l k 2 


6 





Finding Minima Algorithms 


113 


4.11 Hessian Method 


The gradient methods presented in the previous sections do not take into 
account the curvature of the surface z = /(#), which is described by the 
matrix of second partial derivatives 

/ d 2 f(x) \ 

\ dxidxj ) 
k ’ ij 

called Hessian. The eigenvalues of the matrix Hf will be denoted by { 

1 < i < n. The corresponding eigenvectors, satisfy Hf& — A^, with 

\M = 1 . 

A few properties of the Hessian that will be used later are given in the 
following. 



Propositiori 4.11.1 Let f be a real-valued function, twice continuous differ- 
entiable. The following hold: 

(а) The Hessian matrix is symmetric; 

(б) For any x, the eigenvalues of Hf(x) are real; 

(c) Assume H has distinet eigenvalues. Then for any nonzero vector v, we 
have 


<■ Hf(x ) 


V, V 


n 


V 




i=1 


where {A,} are the eigenvalues of Hf(x) and Wj are weights, with Wj > 0 
and Wj = 1. 

(d) Let A m i n and A max be the largest and the smallest eigenvalue of Hf(x), 
respectively. Then 


X 


rriin 


V 


< ( Hf{x)v , v) < A 


max 



Proof: (a) It is implied by the regularity of the function / and commntativity 
of derivatives 

d 2 f(x) = d 2 f(x ) 
dxidxj dxjdxi 

(6) It follows from the fact than any symmetric real matrix has real eigenval¬ 
ues. 

(c) Expanding the vector v in the eigenvector orthonormal basis as v — 
Y^i=i y7 ^i an d using inner product bilinearity, we have 


(Hv,v) 


i 3 i j 

v l v j \i5ij = y~^( v l ) 2 \j■ 

i j i 




114 


Deep Learning Architectures, A Mathematical Approach 


Therefore. 


(Hv, v 


v 


Xi, with 


i \2 


( V 1 ) 

Wi = H2 


v 


(d) Using that 


Amin ^ ^ ^ ^%Xi ^ A 


maxi 


multiplying by 


v 


yields the desired resuit. 


Formula (4.2.4) provides the recurrence in the case of gradient descent 
method 

■ n+1 = x n -S\7f(x n ). (4.11.33) 


X 


The linear approximation 


f(x n+1 ) « f{x n ) + (x n+1 - x n ) T Vf{x n ) 

= f(x n ) - S\\Vf(x n )\\ 2 < f(x n ) 

produces a value less than the previous value, f(x n ). But this is just an 
approximation, and if quadratio terms are being taking into account, then 
the previous inequality might not hold any more. We shall investigate this in 
the following. The second-order Taylor approximation provides 


/(x n+1 ) f(x n ) + (x n+1 — x n ) T X7f(x n ) 4—(x n+1 — x n ) T Hf(x n )(x n+1 — x n ) 

2 

- f(x n ) - S\\Vf(x n )\\ 2 + 1, 5 2 (H f (x n )\7f(x n ), V/(s n )). 

The last term on the right is the correction term due to the curvature of /. 
There are two distinguished cases to look into: 

1. The Hessian Hf is negative definite (all eigenvalues are negative). In this 
case the correction term is negative 

^5 2 (H f (x n )Vf(x n ),Vf(x n )) < 0. 

The previous Taylor approximation implies the inequality 

f(x n+1 )<f(x n )-8\\Wf(x n )\\ 2 <f(x n ), 

1. e., for any learning rate 5 > 0 each iteration provides a lower value for / 
than the previous one. 

2. The Hessian Hf is positive definite (all eigenvalues are positive). In this 
case the correction term is positive 


5 2 (H f (x n )Vf(x n ),Vf(x n ))> 0. 











Finding Minima Algorithms 


115 


Proposition 4.11.1, part (d), provides the following margins of error for the 
correction term 


0 < -<s 2 a 
2 


irim 


V/(x")|| 2 < ±8 2 (H f (x n )Vf(x n ),Vf(x n )) < 


max 


V/(x") 


The idea is that if the correction term is too large, then the inequality 
/(x n+1 ) < /(x n ) might not hold. Considering the worst-case scenario, when 
the term is maximum, the following inequality has to be satisfied: 


f(x n ) - 8\\Vf(x n ) 


, Ix2x 

i 2 u ^max 


V/(x”)|| 2 < f(x n ). 


This is equivalent with the following condition on the learning rate 


8 < 


2 

max 


(4.11.34) 


Hence, if the learning rate is bounded as in (4.11.34), then the quadratio 
approximation provides for /(x n+1 ) a value lower than /(x n ). Formula 
(4.11.33) provides a minimization seqnence for /, given the aforementioned 
constraint for the learning rate. 


4.12 Newton’s Method 

Assume / : W 1 —> R is a function of class C 2 , and let x° E R n be an initial 
guess for the minimum of /. Assume / is convex on a neighborhood of x°. 
We approximate the function / about the point x° by a quadratio function as 

f(x) ~ F{x) — f(x°) + (x — x°) T Vf(x °) 4 —(x — x®) T 'Hf(xP)(x — x°), 

where Hf(x°) is the Hessian matrix of / evaluated at x°. Since / is con¬ 
vex, its Hessian is positive definite. Therefore, the quadratio function F has 
a minimum at the critical point x*, which satisfies VF(x) = 0. Using that 
VF(x) = Vf(x°) + Hf(x°)(x — x°), the equation VF(x) = 0 becomes 

Hf(x°)x — — V/(x°) + i7j(x°)x°. 

Assuming the Hessian does not have any zero eigenvalues, then it is invertible 
and we obtain the following critical point: 

X* = X° — i4yT 1 (x°) V/(x°), 


which is the minimum of F(x). 











116 


Deep Learning Architectures, A Mathematical Approach 


Newton’s method consists of using this formula iteratively constructing 
the sequence ( x n ) n >\ defined recursively by 

x n+l = x n - Fp (x n )V/(x n )- (4.12.35) 

This method converges to the minimum of / faster than the gradient descent 
method. However, despite its powerful value, it has a major weakness, which 
is the need to compute the inverse of the Hessian of / at each approximat ion. 


Remark 4.12.1 It is worth noting that in the case n 
ative formula becomes 


_n+l _ n 


f'{x n ) 

f"(x n ) ' 


1 the previous iter- 


If we denote h(x) — f'(x), then looking for the critical point of / is equi valent 
to searching for the zero of h(x). Convexity of / implies that h is increasing, 
which guarantees the uniqueness of the zero. The previous iteration formula 
is written as 


„ n +1 _ rXn 


h(x n ) 
h'(x n ) 


We have arrived now to the familiar Newton-Raphson method for searching 
a zero of h using successive tangent lines. 


4.13 Stochastic Search 

The gradient descent method does not provide concluding resuits in all cases. 
This can be corrected by developing certain variants of the method, such as 
the momentum method, which is used to avoid getting stuck in a local mini¬ 
mum or to avoid overshooting the minimum. But sometimes there are other 
reasons, such as getting lost in a plateau , from which it takes a long time to 
get out, if ever. A plateau is a region where the gradient of the objective func- 
tion V/ is very small (or zero) corresponding to a relatively flat region of the 
surface. This section presents a method of stochastic flavor, which overcomes 
this major problem. The minimum is searched along a diffusion process rather 
than along a deterministic path. We shall present it in comparison with its 
deterministic counterpart, which we include first. 

4.13.1 Deterministic variant 

If a vector field in R n is given by 


b{x) — b k (x ) 

k =1 


d 

dx k 


5 



Finding Minima Algorithms 


117 


then its integral curve, x(t), passing through the point xq is the solution of 
the differential equation 

dx(t ) = b(x{t )) dt 

x(0) — Xq. 

The vector field b(x(t)) — x{t) represents the field of velocities along the 
integral curves. 

Consider the objective function / : W 1 —> R, subject to minimization. In 
order to achieve this, we look for a vector field b{pc) for which the function 
(p(t) — f(x(t)) is decreasing. Equivalently, we look for a flow x(t) along which 
the objective function / decreases its value. 

Using the derivative given by the chain rule 

= E jr~ x k (t) = (V/, b)\ x{t) , 
k = l OXk \ X( A> 

a linear approximation provides 

ip{t + dt) — cp(t) + (t) dt + o(dt 2 ). 

Hence, to ensure that the difference A cp(t) — ( p(t + dt) — ip(t) is as negative 
as possible, we chose the vector field b(x) such that 

b = argmin(V/, 6), 

which is achieved for b(x) — —X V/(x), for A > 0, i.e., the vector field b is 
chosen to point to the opposite direction of the gradient of /. This method 
is equivalent to the method of steepest descent. This method fails if V/ is 
very small or equal to zero on an entire region. The next variant takes care 
of this peculiarity. 

4.13.2 Stochastic variant 

The idea of this approach is to replace the deterministic trajectory x(t) by a 
stochastic process X t as in the following. In order to drive the iteration out 
of the plateau, we shall superpose some random noise term on the determin¬ 
istic trajectory x{t). This method uses the concepts of Brownian motion, Ito 
diffusion, Dynkin’s formula, and inhnitesimal generator operator, which the 
reader can give a glance in section D.8 of the Appendix. 

After introducing the noise we obtain an Ito diffusion process Xt, starting 
at xo, given by 

dX t = b(X t )dt + cr(X t )dWt 
Xo x 0 , 


118 


Deep Learning Architectures, A Mathematical Approach 


where W t = ... ,W m (t)) is an m-dimensional Brownian motion. The 

coefficients b(x) T — (bi(x), ..., b n (x)) and cr(x) — (J%j{x )) E M n x R m repre- 
sent, respectively, the drift and the dispersion of the process. They have to 
be selected such that the conditional expectation function 


<p(t)=E[f(X t )\X 0 = xo\ 


is decreasing as fast as possible. The valne of the objective function along 
the diffusion, f(Xt ), is a random variable dependent on £, and we would like 
to minimize its expectation, given the starting point Xq — xq. This will be 
accomplished using Dynkin’s formula. Before doing this, we need to recall the 
expression of the infinitesimal operator, A, associated with the Ito process 
X t given by the second-order differentiai operator 



1 

2 


a<7 r V 2 + (V, b). 


The matrix aa T is called diffusion matrix, and we shall get back to its form 
shortly. 

Applying Dynkin’s formula twice, we have 

(f(t) = f(x 0 )+ [ E[A(f(X s ))\X 0 = xo\ds 

J o 

rt+dt 

ip(t + dt) = f{x 0 )+ / E[A(f(X s ))\X 0 = x 0 ]ds. 

J o 

Subtracting and using a linear approximation yields 

/ t-\-dt 

E[A( f(X s )) \X 0 = x 0 \ ds 
= E[A(f (Xt)) |X 0 = xq\ + o(dt 2 ). 


Following the same idea as in the method of steepest descent, we are inter- 
ested in the process X t for which the change A cp(t) is as negative as possible. 
Consequently, we require that 

E [A(f(X t )) \X 0 = x Q ] < 0. (4.13.36) 

To construet the process Xt with this property it suffices to provide a(x) and 
b(x). In a plateau the objective function is flat, and hence the Hessian is small. 
In order to get out of a plateau, we adopt the requirement that the “diffusion 
is large” when the “Hessian is small”. This will steer the process X t away from 



Finding Minima Algorithms 


119 


plateaus. Consequently, we choose the diffusion matrix crcr T proportional to 
the inverse of the Hessian of the objective function, Hf — ( dx J x 1 , i.e. 



(4.13.37) 


with A > 0 constant. In order to investigate the existence of Solutions a for 
equation (4.13.37), we assume the objective function satisfies the following 
properties: 

(z) / is of class C 2 ; 

(ii) f is strictly convex. 

Under these two conditions the Hessian Hf is a real, symmetric, positive def¬ 
inite, and nondegenerate matrix. Hence, it is invertible, with the inverse also 
symmetric and positive definite. By the Cholesky decomposition, there is a 
lower triangular matrix a satisfying (4.13.37). Asking for the strictly convex- 
ity of / is maybe a too strong requirement, but this condition guarantees the 
existence of a dispersion matrix a. Under this choice, we can conveniently 
compute Af as 

AfM = + 

hj k ' /l 

= +(V/(*), 6 (*)) 


In order to minimize Af(x ), the drift term b is chosen such that it mini- 
mizes the second term, (V/(x), b(x)), which comes with the choice b(x) = 
— 7 ?V/(x), with rj > 0, constant. Consequently, 


Af(x) 


n 

-X - rj\\Vf(x) 



Requiring Af(x) < 0 implies that the learning rates A and 77 satisfy the 
ineqnality 

1 < -\\Vf(x)\\ 2 . 

77 n 

2 l 

If L — inf || V/(x )|| 2 ^ 0, then it sufhces to choose A and 77 such that A = — 77 . 


n 


Under this condition the inequality (4.13.36) always holds. The process Xt 
previously constructed represents a stochastic search of the minimum which 
has the advantage of scattering away the search from plateau regions. 

For a better understanding we shall consider the following simple and 
explicit example. 











120 


Deep Learning Architectures, A Mathematical Approach 


Example 4.13.1 We shall consider the one-dimensional case given by the 

objective function f(x) — -x 2 , x G (a, 6), with a > 0. The diffusion dX t — 

b(Xt)dt + cr(Xt)dWt, Xo = xo G (a, b) is one-dimensional, with b{x) and 
c j(x ) continuous functions. Since the Hessian is f"(x) — 1, the equation 


G 


= A 


1 


n implies cr — vA. Also, 6(x) — —r]f'(x) — —r)x. Since L — 

inf a<x< bf\x) 2 — a 2 , then A = 2a 2 r]. The stochastic differential equation 
becomes the Langevirds equation 


dX t — —rjXtdt + vXdWt , Xq — xq. 


The solution is given by the following Orstein-Uhlenbeck process 

X t = X 0 e~ vt + CX [ edW s , 

J 0 

which is the sum between a deterministic function and a Wiener integral. 
Consequently, the process X t is normally distributed with mean x^e 7)1 and 
variance ^(1 — e~ 2r]t ). 



An ant searches for food in a stochastic way. 
If the food is located in the valley, the ant 
will eventually find it. 


Ant food 


Cartoon 2: An ant in a stochastic search for food. 


Therefore, the search of the minimum is done along an Orstein-Uhlenbeck 
process. Its mean represents the expected direction of movement, while the 
integral is the white noise part. The previous relation between learning rates, 
A = 2a 2 rj 1 can be seen as a condition stating that the noise does not dominate 
the mean direction of movement. 

One can picture this as an ant that would like to get off a mountain and 
it tries to do so by searching its way in all directions, such that, on average, 
it lowers its altitude position. 




Finding Minima Algorithms 


121 


The expected altitude <p(t) at time t can be actually computed explicitly 
as in the following 


<p(t) = E[f(X t )\X 0 = x 0 } = -E[(X t ) 2 \X 0 = x 0 } 


= -E 
2 


x 


le~ 2rit + 2x 0 4Xe~ vt J dW s + e - ^ -8 ) dW s ^j 


2i 




+ l x L e ~ Mt ~ s) ds = \ (4 e ~ 27]t + A(i - e ~ 2vt )) 




A 


V 


2rjJ 2rj 


Since — = a 2 < Xn, it follows that (p(t) is decreasing with the minimum 
2 r/ 

n . , x 1 A 1 2 
hm (fit) — -a . 

t\ o^ w 2 2ry 2 


This is actually the expected resuit which can be also obtained using Standard 


Calculus techniques to minimize f(x ) = ^ x z on (a, b). 


4.14 Neighborhood Search 

The method presented in this section searches for a minimum on a neighbor¬ 
hood of the current point of evalnation and then makes a step into the direc- 
tion of the smallest value. The difference from the gradient descent method 
is that we do not have an a priori knowledge of which will the direction of 
greatest descent be. In this method we search along a sphere of radius eqnal 
to the learning step, r). In the case when the dimension is n — 1,2, the method 
can be implemented exactly, while for n > 3 it is computationally convenient 
to consider a stochastic search. 


4.14.1 Left and Right Search 

This search algorithm is applicable just for the one- dimensional case. Given 
that most practical examples of cost functions in machine learning have a 
lot more than one variable, this example is not applicable in real-life applica- 
tions. However, we include it here for completeness reasons and for building 
a discussion basis for the multidimensional case. 

Consider a smooth fnnction f(x) of a real variable x. We start the search 
for its minimum from the initial point xq. We hx r] > 0 small enough and 
evaluate the function to the left, f(x o — 77 ), and to the right, f{pc 0 + 77 ), of xq. 



122 


Deep Learning Architectures, A Mathematical Approach 


• If both neighbor values are larger than the initial value, /(xo ± rj) > 
/(xo), then xo is already a minimum point. In the case of the first iteration 
it is indicated to choose a different initial value xo and start the procedure 
again. 

• If one neighbor value is smaller and the other one is larger, then make a 
step in the direction of the smaller value; for instance, if /(xo — rj) < /(xo) < 
/(xo + r/), then choose x± — xq — rj. 

• If both neighbor values are smaller than the current value, make a step 
into the direction of the smallest value (this is a greedy type algorithm). For 
instance, if f (xq rj) <C /O o + rj) <C /(xo), then choose x\ — xq rj. 

• If in any of the previous bullet points we encounter identities rather 
than inequalities, readjust the size of rj such that the identities disappear. 

• Starting the search procedure from x\ we construet X 2 , and so on. 
The sequence ( x n ) n thus constructed will approximate the minimum of the 
function /(x). 

By its very construction, the sequence ( x n ) n is stationary, i.e., there is a 
rank N > 1 such that x n — xjv, for all x > N. In this case, N represents the 
optimum number of training epochs. 

The method can yet be improved. After the approximation sequence ( x n ) n 
has reached its stationary value, then decreasing the learning rate rj changes 
the sequence into a better approximation sequence. Decreasing the learning 
rate can be applied until the desired degree of accuracy for the minimum is 
obtained. 

4.14.2 Circular Search 

In the two-dimensional case we shall look for a minimum about (xo, yo) using 
a search along the circle of radius 77 , whose equation is 

(x,y) = (x 0 ,yo) + rje lt , 0 < t < 27r. 

Then we move into the direction of the smallest value and repeat the pro¬ 
cedure, see Fig. 4.18. In order to do this, we choose n equidistant points, 
(xq,^q), along the previous circle, whose coordinates are given by 

k 2kn 

x 0 — xo + rj cos- 

n 

h 2/C7T 

Uo = yo + V sm -^ 0 < fc < n — 1. 

n 

We evaluate the function / at the previous n points, (xq,?/q), and pick 
the smallest value. Let 


k 0 = arg min / (xq , y^) 
k 




Finding Minima Algorithms 


123 



Figure 4.18: The circle of center (xo,yo) an d radius rj is divided into n equal 
sectors. At each step the point of the smallest value is chosen and a step into 
that direction is made. 



Figure 4.19: The search of the minimum starts at (xo,j/o) an d follows the 
arrows. At each step the point of the smallest values is chosen and a step 
into that direction is made. 


and choose a new circle of radius 77 with the center at (xi, y\) — (xq°, j/q°) and 
continue the previous procedure. This way we obtain the sequence (x m , y m ) m >0 
which approximates the minimum of /(x, y). The recurrence relation is dehned 

by x m = x^Si , y m = with 





































124 


Deep Learning Architectures, A Mathematical Approach 


B 



D 


Figure 4.20: Two equal circles intersect at points B and D and also pass 
through the centers of each other. Triangles A ABC and A ACD are equi- 
lateral, having ali sides equal to r\. Therefore the angle ZBCD — 120° and 
hence the are BAD represents 1/3 of the total circle circumference, which 
means that the circle centered at C has 2/3 outside of the other circle. 


k 2kn 

x ~ x m + r l cos- 

n 

k . 2 /cTT 

Vrn = Vm + ri sin-, 0 < fc < n — 1 , 

n 

where k m — argmin f(x^ n , y 1 /)- This means that the points belong 

k 

to the circle of center (x m , y m ) and radius 77 , and (x m+ i, y m + 1 ) is roughly the 
point of the largest descent along this circles, see Fig. 4.19. 

At each step it looks that the search of the minimum is done among 
n values. This number can be actually decreased. Assume that 77 is small 
enough so that we can assume that the minimum is not inside the previous 
circle. Then the search of the minimum should be applied only to the out¬ 
side points of the next circle, see Fig. 4.20. The number of these points is 
about 277,/3, since 1/3 of the points belong to the interior of the previous 
circle. 

4.14.3 Stochastic Spherical Search 

In the case when the dimension is large it is computationally expensive to 
construet equidistant points on a sphere of radius 77 . In this situation it is 
more convenient to choose n uniformly distributed points on the sphere. 

Assume we have a smooth function f(x) with x E R n , which needs to 
be minimized. We start the search of the minimum from an initial point 







Finding Minima Algorithms 


125 


xq by constructing n uniformly distributed points, xj,..., Xq , on the sphere 

§(xo,? 7 ) = {x; ||x — xo|| = rj}, where || • || denotes the Euclidean norm in R n . 

Choose /c* = argmin/(xo), and then consider the new sphere §(xi, 7 y), with 

k 

X i — X Q . We continue the procedure by choosing n uniformly distributed 
points on the new sphere, and iterate the procedure. The sphere centers, x m , 
form a sequence that tends toward the minimum point of the function /. 

The only issue we stili need to discuss is how to choose n random points on 
a sphere. We shall start with the two-dimensional case, when the sphere is a 
circle. We consider n uniformly distributed variables # 1 ,..., 0 n ~ U m/(0, 2tt). 
Then 

(a + rj cos b + 7] sin 6 ^), k = 1 ,... n 

are uniformly distributed on the circle centered at (a, b ) and having radius 
r]. The construction in the multidimensional case is similar. Consider the 
uniformly distributed random variables 

01, • • • ,0n-2 ~ U ni f ( 0 , 7 r), 0 n —1 - Unif( 0 , 2 tt) 

Then define x = (xi,..., x n ) on the m-dimensional sphere S(x°, rf) by 

X\ — x\ + 7] COS 01 
X 2 = x[> ~b 7] sin 0i, COS 02 

X 3 = X 3 + 7] sin 0 i sin 02 cos 03 

• • • 

£ m _i = x^ t _ 1 + 17 sin <f>\ • • • sin 4 >m —2 cos 
X m = X° m + r/sin 01 • • • sin ^ m _ 2 sin 

Repeating the procedure, we can create n instances of the above point x. 

As a variant of the previous algorithm, we may consider the n uniformly 
distributed points inside the ball B(xo, 7]) — {x; ||x — xo|| < 7]} instead of the 
sphere §(xo, 7]). 

Remark 4.14.1 Imagine you have to hnd an object hidden in the room. 
For each taken step you are provided with one of the hints u hot” or “cold”, 
whether you get closer, or get farther from the hidden object, respectively. 
This is a stochastic search similar with the one described before. The indicator 
u hot” corresponds to the direction that minimizes the value of the function, 
and a step should be taken into that direction. Since the location of the 
hidden object is unknown, the steps are uniformly random and the one with 
the hottest indicator shall be considered. 



126 


Deep Learning Architectures, A Mathematical Approach 



Figure 4.21: In the search for the global minimum of the surface z — /(x), 
several balls are left free to roll under the gravitation action. They will finally 
land to the local minima in their basins of attraction they initially belonged to. 


4.14.4 From Local to Global 


Most search algorithms will produce an estimation of the local minimum 
which is closest to the initial search point. However, for machine learning 
purposes, we need a global minimum. In order to produce it, one should 
start the minimum search from several locations simultaneously. This way, 
we shall obtain several local minima points, among which the lowest one 
corresponds to the global minimum. 


This method assumes the search domain is compact, so it can be divided 
into a finite partition. One initial search point will be chosen in each partition 
element, for instance, at its center, and a search is initiated there. If the 
partition is small enough, the number of local minima obtained is smaller 
than the number of partitions. All initial points situated in the basin of 
attraction of a local minimum will estimate that specihc local minimum. At 
the end we should pick the one which is the smallest. 

One way to imagine this is by enabling a certain number of balls to roll 
downhill on the surface z = /(x), see Fig. 4.21. Balls B\, B 2 , and B% belong 
to the basin of attraction of the minimum M\ and will roll toward Mi, while 
balls F> 4 , F> 5 , and Bq will roll towards the minimum M 2 . While only one 
ball will hnd only one local minimum, a relatively large number of balls, 
initially sparse enough distributed, will hnd all local minima of the function 









































Finding Minima Algorithms 


127 


f(x). A final evaluation of / at these local minima points yield the global 
minimum. 

4.15 Continuous Learning 

This is learning with the gradient descent having an infinitesimal learning 
rate. In the classical version of gradient descent with learning rate 7] > 0, 
applied to the minimization of the cost function /(#), with x E R n , the 
update is given by 


x(t n+ 1 ) = x(t n ) - rjVf(x(t n )). 


We assume that t n = to + nAt is a sequence of equidistant instances of 
time and we consider the learning rate proportional to the time step, namely 
7] — AAt, with A > 0. Then the previous relation can be written as 

- c(f , + ^;- r(U =-w/ ( *( t „)). 

Taking the limit At —> 0 and replacing t n by t, we obtain the differential 
equation x'(t ) = —AV/(x(t)). This corresponds to continuous learning , which 
is learning with an infmitesimally small learning rate, tj — Xdt. If the initial 
point is #0, the continuous learning problem is looking for a differentiable 
curve x(t) in R n such that the following system of ODEs is satished 


x'(t ) = —AV/(x(t)) (4.15.38) 

x(0) = x Q . (4.15.39) 

In fact, x(t) is an integral curve of the vector held V — — AV/(x), namely 
the gradient vector V(x(t)) is the velocity vector held along the curve x{t). 

The curve x{t) represents the curve along which the cost function has the 
lowest rate of decrease. If u(t) is another curve, then 


d 

dt 


f( U (t )) = (V/ (u(t)), u'(t)) 


and from Cauchy’s inequality 


d 


— l|V/(u(i))|| ||n'(t)|| < — f(u{t)) < ||V/(n(t))|| ||u , (t)||. 

The lowest rate of decrease is obtained when the left inequality becomes 
identity 

-||V/(«(t))|| IK / (t)H = j t f( u U)) 

and this is achieved when u'(t) = — AV/(u(t)), with A > 0 constant. 



128 


Deep Learning Architectures, A Mathematical Approach 


If u(0) = xo, then by the local uniqueness of Solutions property of linear 
Systems of differential equations, it follows that u(t) — x(t) for small enough 
values of the parameter t. In this case, the rate of change of the cost function 
along the gradient descent curve x(t) is given by 

j t fi x {t)) = (Vf{x(t)),x'{t)) = -l||x'(t)|| 2 < 0. 

Then the larger the magnitude of the velocity vector x'(t), the faster the cost 
function decreases. If the minimum x* = argmin /(x) is realized at x* = 

X 

lim x(£), with T G (0, oo], then V/(x*) = 0, or equivalently, lim ll® , (*)ll = 0. 
We shall provide an example of continuous learning following Example 4.2.7. 

I 

Example 4.15. 1 Let / : R n -G R dehned by /(x) = —1| x4x — b || 2 , where 

b G R m and A is an m x n matrix of rank n, and || • || denotes the Euclidean 
norm. Since 

/0) = l||^|| 2 - x T A T b+ 1||6|| 2 , 

the gradient is given by V/(x) = A T Ax — A T b. The continuous learning 
system (4.15.38)-(4.15.39) in this case becomes 

x\t) — —XA t Ax(t) + A A T b 
x(0) — xo- 

Multiplying by the integrating factor e XATAt , we reduce it to the following 
exact equation 


d 

dt 


A A T At 


x(t)) = e XATAt X A T b. 


Integrating, solving for x(t), and using the initial condition, yields 


x(£) — e 


-A A T At 


[ e XATAs ds • A A T b + e ~ XATAt 

J 0 


x 0 


(4.15.40) 


The integral can be evaluated as in the following 


A A T As 


ds = —(A T A) 
o ^ 


i 


e XA T At _ j 


n 


Then substituting in (4.15.40) we obtain 


x(t) — (A t A) 1 A T b — e 


-A A T At 


( A T A ) 1 A T b — xq 

















Finding Minima Algorithms 


129 


Denoting the Moore-Penrose pseudoinverse by x* = (. A T A)~ 1 A T b , see section 
G.2 of the Appendix, we can write the gradient descent curve as 



A A T At 


X — Xq 


Since lim e XA At — 0 n , see Proposition G.2.1 of Appendix, 

t—> oo 

the curve x(£) approaches the Moore-Penrose pseudoinverse, 


which is the minimum of the function /(x) 


1 

2 


Ax — 6|| 2 . 


in the long run 
lim x(t) — x*, 

t—t oo 


4.16 Summary 

Finding the global minimum of a given cost function is an important ingredi- 
ent of any machine learning algorithm. The learning process is associated with 
the procedure of tuning the cost function variables into their optimal values. 
Some methods are of first order, i.e., involve just the hrst derivative (such as 
the gradient descent, line search, momentum method, etc.), fact that makes 
them run relatively fast. Others are second-order methods, involving the sec- 
ond derivative, or curvature, of the cost function (such as the Hessian method, 
Newton’s method, etc.). Other methods have a topological or stochastic flavor. 

The most well-known method is the gradient descent. The minimum is 
searched in the direction of the steepest descent of the function, which is 
indicated by the negative direction of the gradient. The method is easy to 
implement but exhibits difficulties for nonconvex functions with multiple local 
minima or functions with plateaus. The method has been improved in a num- 
ber of ways. In order to avoid to get stuck in a local minimum or overshooting 
the minimum, the momentum method (and its variants) has been developed. 
The line search algorithm looks for a minimum along a line normal to the 
level surfaces and implements an adjustable rate. 

Several adaptive optimization techniqnes derived from the gradient descent 
method, like AdaGrad, Adam, AdaMax, and RMSProp, are presented. They 
share the common feature that involve moments approximations computed 
by exponentially moving averages. 

The Hessian method and Newton’s method involve the computation of 
the second-order approximation of the cost function. Even if theoretically 
they provide good Solutions, there are computationally expensive. 

The simulated annealing is a probabilistic method for computing minima 
of objective functions which are interpreted as the internal energy of a ther- 
modynamical System. Increasing the heat parameter of the System and then 
cooling it down slowly enables the System to reach its lowest energy level. 

Stochastic search proposes an algorithm of driving the search out of 
plateaus. This involves looking for a minimum along a stochastic process 






130 


Deep Learning Architectures, A Mathematical Approach 


rather than a deterministic path. This method is also computationally expen- 
sive, since it involves the computation of the inverse of the Hessian of the 
cost function. 

Neighborhood search is a minimum search of a topological nature. The 
minimum is searched uniformly along spheres centered at the current search 
center. 

All the above search methods produce local minima. In order to find a 
global minimum, certain methods have to be applied. One of them is applica- 
ble in the case of compact domains. In this case the domain is divided into a 
finite partition and a minimum is searched on each partition set. The lowest 
minimum approximates the global minimum. 

4.17 Exercises 

Exercise 4.17.1 Let f{pc 1 ,^ 2 ) — e Xl sin£ 2 , with (xi,£ 2 ) £ (0,1) x (0, f )• 

(a) Show that / is a harmonic function; 

{b) Find || V/H; 

(c) Show that the equation V/ = 0 does not have any Solutions; 

(d) Find the maxima and minima for the function /. 

Exercise 4.17.2 Consider the qnadratic function Q(x) — ^x T ix - 6x, with 
A nonsingular sqnare matrix of order n. 

(a) Find the gradient VQ; 

(b) Write the gradient descent iteration; 

(c) Find the Hessian Hq\ 

(d) Write the iteration given by Newton’s formula and compute its limit. 

Exercise 4.17.3 Let A be a nonsingular sqnare matrix of order n and b E R n 
a given vector. Consider the linear system Ax = b. The solution of this System 
can be approximated using the following steps: 

(a) Associate the cost function C(x) = ^||Ax — 6 || 2 . Find its gradient, VC(x), 
and Hessian, iL<y(x); 

(b) Write the gradient decent algorithm iteration which converges to the 
system solution x with the initial value x° = 0 ; 

(c) Write the Newton’s iteration which converges to the system solution x 
with the initial value x° = 0 . 

Exercise 4.17.4 (a) Let (a n ) n be a sequence with ao > 0 satisfying the 
ineqnality 

CZ/ 77 ,—j— 1 ^ fjjCLfi -/F, Vn ^ 1, 



Finding Minima Algorithms 


131 


with 0 < /i < 1 and K > 0. Show that the sequence (a n ) n is bounded from 
ab ove. 

(6) Consider the momentum method equations (4.4.16)-(4.4.17), and assume 
that the function / has a bounded gradient, ||V/(x)|| < M. Show that the 
sequence of velocities, ( v n ) n , is bounded. 

Exercise 4.17.5 (a) Let / and g be two integrable functions. Verify that 

J (/ * 9)( x ) d x — J f{x)dx J g(x)dx ; 


(■ b ) Show that ||/ * g ||i < ||/||i ||^||i; 


(c) Let fer = / * Gcr, where G a (x) = 
any cr > 0. 


2 a 2 . Prove that \\fcr\\i A ||/||i for 


Exercise 4.17.6 Show that the convolution of two Gaussians is also a Gans- 
sian: 

G (Jl * G(j 2 — Gcr , 

with a — \Jo\ + cr|. 


Exercise 4.17.7 Show that if n numbers have the sum equal to s. 


<Ji + • • • + CF n — s, 

then the numbers for which the sum of their squares, Xq=i rninimum 

occurs for the case when all the numbers are equal to s/n. 







® 

Check for 
updates 


Chapter 5 

Abstract Neurons 


The abstract neuron is the building block of any neural network. It is a unit 
that mimics a biological neuron, consisting of an input (incoming signal), 
weights (synaptic weights) and activation function (neuron firing model). 
This chapter introduces the most familiar types of neurons (perceptron, sig- 
moid neuron, etc.) and investigates their properties. 

5.1 Definitiori and Properties 

In the light of examples presented in Chapter 1, it makes sense to consider 
the following definit ion that formalizes our previously developed intuition: 

Definitiori 5.1.1 An abstract neuron is a quadruple (x, w, p, y), where x T = 
(xo, xi,..., x n ) is the input vector, w T = (reo, rei,..., re n ) is the weights 
vector, with xq — —1 and wq — b, the bias, and ip is an activation function 
that defines the outcome function y — (p(x T w ) = L w i x i)- 

The way the abstract neuron learns a desired target variable z is by tuning 
the weights vector w such that a certain function measuring the error between 
the desired variable z and the outcome y is minimized. Several possible error 
functions have been covered in Chapter 3 and their minimization algorithms 
were treated in Chapter 4. 

The abstract neuron is represented in Fig. 5.1. The computing unit is 
divided into two parts: the hrst one contains the summation Symbol, E, which 
indicates a summation of the inputs with the given weights, and the second 
contains the generic activation function p used to dehne the output y. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_5 


133 



134 


Deep Learning Architectures, A Mathematical Approach 



Figure 5.1: Abstract neuron with bias b and activation function ip. 

It is worth noting that the threshold b is included in the weights System 
as a bias, wq — 5, with constant input xq = — 1. Equivalently, we may also 
consider the bias wq — —b and the corresponding input xo = 1. In any of the 
cases, the expression of the inner product is given by 

rji rji 

x w = w x = w\X\ + • • • + w n x n — b. 

The vectors w T = (reo, • • •, w n ) and x T = (xo,..., x n ) are the extended 
weights and extended input vectors. The output function y — /(x) = (/?(w T x) 
is sometimes called the primitive function of the given computing unit. 

The inputs {xi}i= i ?n are communicated through edges to the computing 
unit, after there are previously multiplied by a weight, see Fig. 5.1. The input 
data can be of several numerical types, such as: 

• binary if X{ G {0,1} for i — 1, n; 

• signed if X{ G { — 1,1} for i — 1, n; 

• digital if Xi G {0,1, 2, 3, 4, 5, 6, 7, 8, 9} for i = 1, n; 

• arbitrary real numbers if X{ G (— oo, oo) for i — 1, n; 

• an interval of real numbers if Xi G [0,1] for i = 1, n. 

For instance, any handwritten digit can be transformed into binary data, 
see Fig. 5.2. The 4x4 matrix is read line by line and placed into a sequence 
of Os and ls with length 16. A valne of 1 is assigned if the pixel is activated 
more than 50%. Otherwise, the value 0 is assigned. Coding characters this way 
is quite simplistic, some informat ion regarding local shapes being lost in the 
process. However, there are better ways of encoding figures using convolution. 
In general, in the case of grayscale images each pixel has an activation which is 
a number between 0 and 1 (1 for full black activation and 0 for no activation, 
or white color). 

Input efficiency Assume that one would like to implement a neuron Cir¬ 
cuit. One question which can be asked is which type of input is more efficient? 
Equivalently, how many switching states optimize the transmitted Informa¬ 
tion from the Circuit implementation cost point of view? 








Abstract Neurons 


135 



0110001001000100 


Figure 5.2: Transformation of a 4 x 4 pixels character into a sequence of Os 
and 1 s. 


To answer this, let /3 denote the number of input signal States, for instance, 
/3 — 2 for the binary signals. The number of channels, or inputs, is provided 
by the number n. The implementation cost, F, is assumed proportional with 
both n and /3, i.e., F = cn/3, where c > 0 is a proportionality constant. Using 
n channels with /3 States, one can represent /3 n numbers. 1 Now we assume 
the cost F is fixed and try to optimize the number of transmitted numbers as 
a function of /3. Consider the constant k = Then n — ^ Therefore 

k 

/(/3) = /3p numbers can be represented, for a given /3 > 0. This function 
reaches its maximum at the same point as its logarithm, g(/3) = ln/(/3) = 
^ ln/3. Since its derivative is g'(/3) — -|(1 — ln/3), it follows that the maximum 
is achieved for /3 = e ~ 2.718. Taking the closest integer value, it follows that 
/3 = 3 is the optimal number of States for the input signals. However, this 
book deals with the theory of neuron models with any other number of input 
States. 

In the following we shall present a few classical types of neurons specifying 
their input types and activation functions. 


5.2 Perceptron Model 

A perceptron is a neuron with the input either zero or one, i.e., X{ E {0,1}, 
and with the activation function given by the Heaviside function 

/ \ f 0, if x < 0 

v(x) = { 1, if x > 0. 


lr This is the number of sequences of length n with elements chosen from the given f3 
states. By a sequence we understand a function h : {1,..., n} —, /3}. 
































136 


Deep Learning Architectures, A Mathematical Approach 


The output of the perceptron is given as a threshold gate 


y — p(x T w — b ) 


0, if Y!i=i w i x i < b 
1, if Y!i=l W i X i> b - 


This way, a perceptron is a rule that makes decisions by weighting up evidence 
supplied by the inputs X{. The threshold b is a measure of how easy the 
perceptron’s decision to get the output 1 is. For all general purposes, this is 
regarded as another weight, denoted by reo, called bias. 

We shall deal next with the geometric interpretation of a perceptron. 
Consider now an ( n — l)-dimensional hyperplane in R n dehned by 


n 

H = {(zi,... ,x n )]'^2w i x i = b}. 

1=1 

Its normal vector N is given in terms of weights as 

N t = (wi,.. .,w n ), 

where T denotes the transpose vector, as usual. The hyperplane passes through 
a point p, which is related to the bias by relation b — p T w. Then the outcome 
y of a perceptron associates the value 0 to one of the half-spaces determined 
by the hyperplane H, and 0 to the rest (the other half-space and T~L). For 
the case of two inputs, n — 2, see Fig. 5.3. Roughly speaking, given a linear 
boundary between two countries, a perceptron can decide whether a selected 
point belongs to one of the countries, or not. 

The perceptron can implement the logical gates AND (“A”) and OR 
( “V” ) as in the following. This property makes the perceptron important for 
logical computation. 

Implementing AND. Consider the Boolean operation dehned by the next 
table: 


X \ 

X 2 

y = x 1 /\x 2 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

1 

1 


The same output function, y — x\ A £ 2 , can be generated by a perceptron 
with two inputs xi,X 2 G {0,1}, weights w\ — W 2 — 1, bias b — —1.5, and 
xq — —1, see Fig. 5.4 a. We have 


y = ip(x 1 + x 2 - 1.5) 


0, if x\ + X 2 < 1.5 
1, if x\ + X 2 > 1.5 






Abstract Neurons 


137 



Figure 5.3: a. The separatiori of the plane into half-planes. b. The graph of 
the activation function for a perceptron with two inputs, x\ and £ 2 . 


The condition {x\ + £2 > 1.5} is satisfied only for x\ — X 2 — 1, while the 
second condition {x\ + X 2 < 1.5} holds for the rest of the combinations. It 
follows that this specihc perceptron has the same output function as the one 
given by the previous table. 

In the light of the previous perceptron’s geometric interpretation, the line 
X\ T X 2 — 1.5 splits the input data {(0, 0), (0,1), (1, 0), (1,1)} into two classes, 
by associating the value 1 to the half-plane {x\ + X 2 > 1.5}, and the value 0 
to the open half-plane {x\ + X 2 < 1.5} see Fig. 5.5 a. 

Implementing OR. This Boolean operation is dehned by the table: 


£1 

£2 

y — x\ V x 2 

0 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

1 


The output function, y = x\ V £ 2 , is generated by a perceptron with two 
inputs £ 1 , X 2 G {0,1}, weights w\ — W 2 — 1, bias b — —0.5, and xq — —1, see 
Fig. 5.4 b. The output is written as 


y = ip(x 1 + £2 - 0.5) 


0, if £1 + £2 < 0.5 
1, if £1 + £2 > 0.5 


The conclnsion follows from the observation that condition {£1 + £2 < 0.5} 
is satisfied only for £1 = £2 = 0. 

Figure 5.5 b represents a split of the input data {(0, 0), (0,1), (1, 0), (1,1)} 
into two classes: the hrst is the data in the shaded half-plane {£1 +£2 > 0.5} 
and the other is the data in the open half-plane {£1 + £2 < 0.5}. The value 
1 is associated to the hrst class and the value 0 the the second class. 
























138 


Deep Learning Architectures, A Mathematical Approach 



Figure 5.4: Boolean function implementation using a perceptron: a. The out- 
put function AND. b. The output function OR. 



Figure 5.5: Partition using a perceptron; the shaded half-plane is associated 
the value 1. a. The output function AND. b. The output function OR. 


However, a perceptron cannot implement ali logical functions. For instance, 
it can’t implement XOR (exclnsive OR) function, which is an operation 
defined by the table: 






























Abstract Neurons 


139 


X \ 

X2 

V 

0 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

0 


The impossibility for a perceptron to implement XOR function follows from 
the fact that there is no line that separates symbols ©and Qin Fig. 5.6 a. 
This can be shown in a couple of ways. 

(а) One proof follows as a conseqnence of the separation properties of 
the plane. If one assumes, by contradiction, the existence of a separation 
line, then the plane is divided by the line in two half-planes, H\ and i? 2 , 
which are convex sets 2 , see Fig. 5.6 b. Since both Q-symbols belong to i© 
then by convexity the line segment joining them is entirely included into H\. 
Similarly, the line segment joining both ©-symbols is contained in H\. Since 
H\ and H 2 do not have any points in common (as they form a partition of the 
plane), the line segments joining both Q-symbols and both ©-symbols are 
disjoint. However, it can be seen from Fig. 5.6 a that these two line segments 
do intersect, as diagonals of a a sqnare. This leads to a contradiction, which 
shows that the assumption made is false, and hence there is no separation 
line. 

(б) The second proof variant is based on algebraic reasons. Assume there 
is a separation line of the form w\x\ + W 2 X 2 — 5, which separates the points 
{(0,0), (1,1)} from the points {(0,1), (1,0)}. Considering a choice, we have 
from testing the first pair of points that 0 < b and w\ + W 2 < b and for the 
second pair that w\ > b, w >2 > b. The sum of the last two ineqnalities together 
with the second ineqnality implies the contradictory inequality 2 b < 5, since 
b is positive. 

The fact that one perceptron cannot learn the XOR function was pointed 
out in [87]. We shall see later in Remark 6.1.2 that a neural network of two 
perceptrons can successfully learn this Boolean function. 

The next paragraph will present the perceptron as a linear classifier in 
more detail. 

2 A set is convex if for any two points that belong to the set the line segment dehned by 
the points is included in the set. 







140 


( 0 , 1 ) 

1 


Deep Learning Architectures, A Mathematical Approach 


( 1 , 1 ) 

0 


0 - 1 

( 0 , 0 ) ( 1 , 0 ) 



a b 

Figure 5.6: Cluster classification using one perceptron: a. The output function 
XOR. b. A line divides the plane in two convex sets H\ and H 2 . 


Clusters splitting. Motivated by the previous application of splitting the 
input data into classes, we extend it to a more general problem of two-cluster 
classification. We allow now for the input data to be numbers taking valnes in 
the interval [0,1] and assume for the sake of simplicity that n — 2. Then each 
data is a pair of real numbers that can be represented as a point in the unit 
square. Assume now that we associate a double-valued label or characteristic 
with each point, given by “group 1” (or, red color, star shaped, etc.) and 
“group two” (or blue color, disk shaped, etc.). The qnestion that arises is: 
Can a perceptron decide whether a point belongs to one group or another? 
The answer depends on the data distribution. If data can be separated by a 
line such that one group is contained in one half-plane and the other in the 
other half-plane (i.e., the groups are separated by a line), then the perceptron 
might eventually be able to decide which group is associated with what label 
after a proper tuning of the weights. In order to do this, we introdnce the 
function 


z(x 1 ,X 2 ) 


0, if (xr,x 2 ) e Q\ 
1, if (xi,x 2 ) E </ 2 , 


where notations Q\ and Q 2 stand for “group 1” and “group 2”, respectively. 
The percepton decides between the two groups if we are able to come up 
with two weights wi, and a bias b such that the line w\x 1 + w 2 X 2 — b — 0 
separates the groups $ 2 - If this is possible, then the output function 


y = ip(w\x\ + W 2 X 2 - b ) 


has the same expression as the aforementioned function z. More details about 
the learning algorithms of a perceptron will be provided in later chapter. 





Abstract Neurons 


141 


height (in) 



Figure 5.7: Perceptron classifying groups of males and females. The line x 2 = 
— 5xi + 115 is a decision boundary in the feature space {xi,£ 2 } between two 
linearly separable clusters. 


Example 5.2.1 Consider the input x — (xi,£ 2 ), where x\ denotes the shoe 
size and £2 the height (in inches) of some individnals of a given population. 
We associate a label to each individual by considering the mapping (xi, £ 2 ) —> 
£ G {man,woman}. Given a feature pair (xi,X 2 ), a perceptron is able to 
distinguish whether this corresponds to a man or to a woman if there is a 
separation line, w\X\ + w2X2 — 5 , between the two gender groups. 

Data collected from a population of Michigan freshmen has the scatter 
plot shown in Fig. 5.7. We notice that the female group is linearly separable 
from the male group and a separation line is given by £2 = —5£i + 115. 
Hence, given an individual with shoe size s and height /1, the perceptron 
would classify it as “male” if h > —5 s + 115 and as “female” otherwise. 

This problem can be formulated in terms of neurons in a couple of ways, 
as follows. 

(i) First, we shall associate numerical labeis to data points. We shall label 
the points corresponding to males by z — 1 and the ones corresponding to 
females by z = —1. Consider a neuron with input x = (xi,£ 2 ), weights 





142 


Deep Learning Architectures, A Mathematical Approach 


W 2 , and bias 6, having the step activation function 


x _ f 1, if u > 0 
U \ —1, if u < 0. 

The neuron’s output is the function y — S(wiXi-\-W 2 X 2 ~b), see Fig. 5.8 a. If 
(xi, X 2 ) corresponds to a man, then w\X\+W 2 X 2 ~b > 0, and hence the output 
is y = 1. If (xi, X 2 ) corresponds to a female, then w\X\ + W 2 X 2 — b < 0, and 
then the output is y = — 1. The parameters of the separator line are obtained 
by minimizing the distance between the outcome and label vectors as 



(wl , W 2 ,5*) — arg min 

Wi,b 



2 

5 


with \\y — z\\ 2 — {v( x )~ z ( x )) 5 where z(x) and y(x) are, respectively, 

the label and the output of the data point x. This distance can be minimized 
using, for instance, the gradient descent method. 


(ii) In this case we shall label data using one-hot vectors. We associate the 
vector (1,0) to males and (0,1) to females. We consider a neural network 
with input x = (xi,X 2 ), and two output neurons with a softmax activation 
function 


(y 1 , 1 / 2 ) = softmax(ui, U 2 ) 


,U 1 


,U2 




see Fig. 5.8 b. Here, U{ are the signals collected from the inputs as 


u\ — w nxi + 1^21X2 — b\ 
U2 — W 2 \X\ + W22X2 ~ & 2 - 


The cost function subject to be minimized in this case is 


E 

x^data 



yi(x) - zi(x)\\ 2 + H 2 / 2 CO 



5.3 The Sigmoid Neuron 

We have seen that a perceptron works as a device that provides two decision 
States, 0 and 1, with no intermediate values. The situation can be improved 
if the neuron would take all the States in the interval (0,1); this way, outputs 
closer to 1 correspond to decisions that are more likely to occur, while the 
outputs closer to 0 represent decisions with small chances of occurrence. In 
this case, the neuron output acts as the likelihood of taking a certain weighted 
decision. 


















Abstract Neurons 


143 



Figure 5.8: a. The neuron in the case when the labeis are z E { — 1,1}. b. The 
neural networks in the case of one-hot-vector labeis. 


A sigmoid neuron is a computing unit with the input x = (aq,... , x n ) 
and with the activation function given by the logistic function 


cr(z) = 


1 + e 


(5.3.1) 


The vector w T = (rei,..., w n ) denotes the weights and b is the neuron bias. 
The output of a sigmoid neuron is a number in (0,1), which is given by 


y — <t( w t x — b) — 


1 + e - wTx + 6 


If the activation function is considered to be the scaled logistic function 


<t c (z) = 


1 + e 


—cz 


C > 0 


then the output of the sigmoid neuron tends to the output of the perceptron 
for large values of c. 

The advantage of this type of neurons is twofold: (i) it approximates per- 
ceptrons; (ii) the output function is continuous with respect to weights, i.e., 
small changes in the weights and threshold correspond to small changes in the 
output; this does not hold in the case of the perceptron, whose output has a 
jump discontinuity. This continuity property will be of uttermost importance 
in the learning algorithms in later chapters. 

It is worth noting the way the weights relate to the input X{ and the 
output y. For this we shall use the formula for the inverse of the logistic 

function, <r -1 (x) = ln-. Then the output equation y — <r(w T x — b ) 

1 — x 

becomes ln ——— = w T x — b. Now, we evaluate the expression at two distinet 

i -y 



144 


Deep Learning Architectures, A Mathematical Approach 


values of inputs, x* — :r r and x* — x t + 1 : 

y 


ln 


i - y 


ry* . - ry» 


= w\Xi -j-b WiX*-\ -b w n x n -b 


ln 


y 


1 - 2 / 


Xi = x\ + 1^) = W\X\ H-b Wi(x * + 1) H-b ~ b. 


Subtracting, and using the properties of logarithmic functions, yields 



y 


i -y 


Xi = xt + 1 


2 / 


i -y 


ry* . - ry» ^ 

,Aj i i 


(5.3.2) 


which is an expression independent of the valne x*. As long as the ith input 
increases by 1 unit, formula (5.3.2) provides an expression for the zth weight 
of the neuron, in terms of the output y. In statistics, the expression 

——— is used under the name of odds to favor. This way, the weight Wi can 

i - y 

be expressed in terms of the quotient of two odds to favor. 


Different types of sigmoid neurons can be obtained considering activation 
functions given by other sigmoidal functions, such as: hyperbolic tangent, 
arctangent fnnction, softsign, etc. The only difference between them is the 
learning speed dne to different rates of saturation. However, from the mathe¬ 
matical point of view they will be treated in a similar way. The infinitesimal 
change in the output of a sigmoid neuron, y — (^(w T x — 5), with activa¬ 
tion function ip, in terms of the changes in weights and bias is given by the 
differential 

dy = ^2 dwi + xjr db 

owi ob 

i =1 
n 

— (//(w T x — b)xidwi + g/(w T x — b)(— 1 ) db 
i=1 

n 

— (//(w T x — b) ^ X{dw{ — db^j . 

2=1 


We note that the change dy is proportional to the derivative of (p. The rate 
( p' depends on the sigmoidal fnnction chosen, and in some cases can be easily 
computed. For instance, if <p(z) — cr(z) is the logistic function, then 


n 


dy — <t(w t x — b) (l — <t(w t x — 6)) ( xidwi — db 


i —1 


In the case when ip(z) — t (z) is the hyperbolic tangent function, we have 

n 

dy — (1 — t 2 (w T x — b)) ^ Xidwi — db^j . 

2=1 













Abstract Neurons 


145 


The previous differentials are useful when computing the gradient of the output 

Vy = (V w y, d h y) = (x^'(w r x - 6), -<p'( w T x - b )) = (p'{ w T x - 6)(x, -1). 

This States that the gradient output, Vy, is proportional to the extended 
input vector (x, — 1), the proportionality factor being the derivative of <p. 
This will be useful in the back-propagation algorithm later. 

5.4 Logistic Regression 

In this section we shall present two applications of the sigmoid neuron which 
learns using logistic regression. The hrst example deals with forecasting the 
probability of default of a company, and the second one with a binary clas- 
sihcation into clusters. 

5.4.1 Default probability of a company 

We are interested in predicting the probability of default of a company for a 
given time period [ 0 ,T], using a sigmoid neuron. Assume the input x of the 
neuron is a vector which contains some information regarding the company, 
such as: cash reserves, revenue, costs, labor expenses, etc. The training set 
consists of n pairs (x^,^), with x^ inputs as before and zi E { — 1,1}. A 
valne Z{ — \ means the zth company has defaulted during [0, T]; a value 
Z{ — —1 means that the zth company has not defaulted during [0, T]. The 
measurements (x^,^)i<^< n represent empirically a pair of random variables 
(A, Z), where Z takes values ±1 and A is a m-dimensional random vector. 
The conditional probability P(Z | A) is given by the table 


z 

-1 

i 

P(z 

X) 

1 — h(x) 

h(x) 


for some positive function h : R m —> [ 0 , 1 ], where m is the dimension of the 
input A. One convenient way to choose h(x) is using the sigmoid function 

h(x) — cr(w T x). 

The inner product w T x is a weighted sum of the inputs and can be interpreted 
as a “default score” of the company with the input x. A high (positive) score 
w T x implies a value of <r(w T x) close to 1 , which can be interpreted as a high 
probability of default, provided that <r(w T x) is considered as a probability. 
On the other side, a low (negative) score w T x implies a valne of cr(w T x) close 
to 0, which can be interpreted as a low probability of default. These two cases 
can be encapsulated in only one formula by writing the default probability as 









146 


Deep Learning Architectures, A Mathematical Approach 


cr(zw T x), where z E { — 1,1}. This can be shown as in the following. When 
z = 1, we recover the above formula, h(x) = cr(w T x), and when z = —1, 
using the property of the sigmoid, cr(—x) = 1 — cr(x), we have 

a(z w T x) = a(—w T x) = 1 — cr(w T x) = 1 — h(x). 

Hence, the previous table can be written equivalently as 


z 

-1 

i 

P(z 

X ) 

a(— w T x) 

a( w T x) 


The probability distribution depends now on the parameter w, and we shall 
consider it as a “model distribution”, which is produced by the neuron. The 
next step is to find the weight w for which the model distribution approxi- 
mates in “the best way” the training data {(x^, ^)}, 1 < i < n. This shall be 
accomplished by choosing w such that the likelihood of Z{ being in a proxim- 
ity of yi is maximized. Assuming the training measurements are independent, 
we have 


w* = argmax P(zi,..., z n \x \,..., x n ) 

w 


22 


n 


argmax P(zi\xi) = argmaxln ^ JJP(^|x^ 

i —1 2=1 

n 


n n 

— argmax > \nP(zi\xi) = argmin ( — > In P(zi\xi)) 

w ^^ w V ^^ / 

2=1 2=1 

1 71 

— argmin ( -y^lnPfzdxA^) = argmin E^ x — lnP(Z\X) 

w V n ' / — 


w 


2=1 


where we have used the properties of logarithmic function and the fact that a 
factor of ^ does not affect the optimum, as well as the fact that a switch in the 
sign swaps the max into a min. The final expression involving the expectation 
with respect to px-> is a the cross-entropy of the empirical probability px with 
the conditional probability P(Z\X). This expression of the cross-entropy can 
be computed explicitly using the previous formula for the probability and 
the expression (5.3.1) for the sigmoid function as in the following 


E px 


-]nP(Z\X) 


\ n i 22 

- WinP(zi\xi) = -Wlncr(zjW T Xj) 

n z ' n z ' 

2=1 2=1 


22 


22 


E ln 


n 1 + e ~ ZiW Xi n 

i =1 i =1 


E ln (! + 


-ZiW T XLi 


















Abstract Neurons 


147 


The optimal weight is then given by 


n 


* 

W — 


arg mm 
w \n 


HnE ln ( 1 + e 

i =1 


T 

-ZnW X, 


This can be written using the softplus function as 


* 

w — 


1 n 

arg min ( — ) sp( — z^w T x^)), 
w \n z ' / 

i =1 


or, equi valent ly, as 


n 


* 

W — 


i =1 


arg min (— sp(ziW T Xi) -z^w T x 7 - 

w \n n 


n 


i / 5 


i =1 


where we used that sp(—x ) = sp(x) — x. 


Gradient Descent There is no closed-form solution in this case for the 
optimum weight w*. We shall approximate the value w* by a sequence 
using the gradient descent method. The gradient of the previous cross-entropy 
error 

1 n 

F( w) = - yUn(l + e ~ ZiW Xi ) 

i —1 

is given by 


n 


v w f = — E 


ZiXiC 


T 

-Zi W X, 


n 


n 1 + e~ z * w x * 

i =1 


= --E 


Zi*i 


n *—' 1 + e^ w x * 
i =1 


n 


n 

E ZiXia(-Zi\v T Xi) 


i = 1 
n 


n 


= - E ^ x i f7 ( z iW T X,) - 1 E 

i —1 ?‘=1 


Note that the last term is the empirical expectation of the product of random 
variables X and E, i.e., ^ = E[XZ]. The hrst term is a weighted 

sum of the entries (x^, zi) with the weights pi — cr(ziW T Xi) E (0,1). 

To conclude, given an initial weight initialization the approximation 
sequence that approaches the minimum argument, w*, of the error function 
F{ w) is given by 


w 


(i+i) 




^V W X( 



p 

n 


n T 

^ZiXiai-ZiW^ x^, 
i —1 



148 


Deep Learning Architectures, A Mathematical Approach 



Figure 5.9: The family of lines w\X\ + W2X2 — 5 = c, c G R. The sigmoid 
function a maps each line to a point in ( 0 , 1 ) regarded as a probability. 


where 77 > 0 is the learning rate. 

How do we apply this technique in practice? Assume we have the training 
set {(x^, and consider a given inpnt x describing the parameters of a 

certain company we would like to evalnate the probability of default. Then 

the desired probability is given by /i(x) = cr(w* T x), where w* = lim w^\ 

j-> 00 


5.4.2 Binary Classifier 


This section deals with the sigmoid neuron as a binary classifier. This means 
that the sigmoid neuron is trained to distinguish between two distinet clnsters 
in the plane, the learning algorithm being the logistic regression. 

Assume we have two groups of points in the plane: black and white. We 
would like to split the points in two groups: black points cluster, denoted by 
Q\ and white points cluster, denoted by C/ 2 - If x — {x\,X2) represents the 
coordinates of a point in the plane, consider the target given by the following 
decision function: 


z(x 1, X2) 


1, if x G Q\ 

- 1 , ifxG(? 2 , 


i.e., each black point is associated the label 1 , while each white point the label 
— 1 . Assume there is a decision line in the plane, given by W\X\+W 2 X 2 — b = 0 , 
which attempts to split the points into clnsters Q\ and Q 2 ; say Q\ above the 
decision line and Q 2 below it, see Fig. 5.9. The job is to adjust parameters 
wi,W 2 , and b such that the aforementioned line represents “the best” split of 
data into clnsters of identic color. 








Abstract Neurons 


149 


Consider the partition of the plane into lines parallel to the decision line 


{w\X\ + W2X2 — b = c; c E R} 


see Fig. 5.9. The lines with a positive value of c correspond to the black region, 
Q 1 , while the lines with a negative value of c correspond to the white region, 
Q 2 • The line corresponding to c = 0 is the decision line between the two 
clnsters of distinet colors. We can think the plane as a gray scale from pure 
white, when c = — 00 , to pure black, for c = + 00 . Also the 50% black-white 
mixture is realized for c — 0. 

Each point is more or less the same color to its neighboring background. 
If a given point has a color that is in large discrepancy with the background, 
then the “surprise” element is large, and this corresponds to a large amount 
of information. If the point color and the background do not differ mnch 
in tone, the amount of information provided by this point is small. We shall 
construet an information measure based on this color difference effect, subject 
to be minimized later. The information associated to a point of a given color 
with coordinate x = (xi,X2) is defined as 

TT(x) — — lncr(z(x)(w T x — 5 )), 


where w = (wi,W 2 )- We shall explain this construction for each of the fol- 
lowing cases: unclassified points and correct classified points. 


The case of unclassified points. Now, we pick an unclassified point of coordi¬ 
nate x, i.e. a point of a different color than its neighboring background. For 
instance, we pick the point with label “1” in Fig. 5.9. This is a white point in 
a very black region, so the information associated with this event is large. We 
represented this in the figure by considering a larger radius, so the informa¬ 
tion is associated with the area of the white disk. There is a positive constant 
c > 0 such that this point belongs to the line w\X\ + W 2 X 2 — b — c. The infor¬ 
mation measure is constructed now as in the following. The sigmoid function 
a maps the constant c into a value between 0 and 1, which can be considered 
as a probability, P(x) — cr(c). The information associated with the point x 
is given by the negative log-likelihood function, — ln P(x) — — lncr(c). If c is 
large, then <j(c) is close to 1 and hence the information — ln P(x) is close to 0, 
which does not make sense, since when c is large, the information should be 
also large (as a white point in very black region). To fix this problem, we shall 
define the information slightly different, using the complementary probabil¬ 
ity. Since the point is white, r G then z(x) = — 1. Define the probability 
as P(x) = cr(z(x)c) = cr(—c ) = 1 — cr(c), so when c is large, the probability is 
close to 0 and hence the information —lnP(x) = — ln a(z(x)c) = — ln cr(—c) 
is also large. 


150 


Deep Learning Architectures, A Mathematical Approach 


The same approach applies to the black point with the label “2” laying in 
the white region, see Fig. 5.9. In this case c < 0 and, since the point is white, 
x E Q i, we have z(x) — 1. The information is given again by — ln cr(z(x)c), 
and for c very negative the value of the information is large, as expected. 

The case of correct classified points. We choose a correct classihed point, 
for instance, the white point with label “3” situated in the white area, see 
Fig. 5.9. Let x denote its coordinate and c < 0 be the constant such that 
w T x — b — c. The information associated with this point is small if the 
difference in the tone color with respect to the background is small. This 
occurs when the constant c is large and negative. We associate a probability 
with x, dehned by P(x) — cr(z(x)c). In this case z{x) — —1, since the point 
is white, x E (? 2 - Then, for c large and negative, the value of cr(z(x)c) is close 
to 1, and hence its logarithm is close to 0. Therefore, it makes sense to define 
the information of this point as — ln a(z(x)c). 

Now we are in the situation of being able to find the best fitting line using 
an information minimization error. We shall see that this is equivalent to the 
maximum likelihood method. Consider n points in the plane with coordinates 
Xi, and color type Zi — z(xt ), 1 < i < n. The total information is dehned as 
the sum of all individnal points information (both correct and misclassihed). 
This is 

n n 

E(x.,z ) = l n &(zj (w T Xj - b )). 

2=1 2=1 

The “best decision line” is the line corresponding to the values of w and b 
that minimize the aforementioned information. The optimal values are given 

by 


(w *,b*) 


argmin.E(x, z) 

w ,b 


— arg max 

w ,b 


n 

"p ln a(zj(w T Xj - 
2=1 



arg max 

w ,b 


n 

S P j \na(zi(-w T x i - 
2=1 



arg max 

w ,b 


n 

ln (|P[cr(zi(w T Xj - 
2=1 



arg max 

w ,b 



n 

2=1 



5 


where P w ^(z|x) represents the probability that the color is of type z (black 
or white), given that the point coordinate is x. We infer that hnding the 
optimal parameters (w*,&*), which define the decision line w* T x — 5* = 0, 
are obtained using a maximum likelihood method. 



Abstract Neurons 


151 


It is worth noting that the aforementioned error function, 22 (x, z), can be 
also regarded as a cross-entropy of the empirical probability of X with the 
conditional probability of Z, given X, as 


£(x, z) = 


In P(Z\X)], 


with the model probability given by P[z\X — x) = — lncr(z(w T x^ — 5)). 

Gradient descent In the absence of closed-form formulae for w* and 5*, 
we shall employ the gradient descent method to obtain some appropriate 
approximation. The gradient has two components 


V£ = (V w E,d h E). 


Writing the error as 


n 


n 


e(x,z) = - 52 in — 

t=i 1 + e 

differentiating, we obtain 


1 X—-r , ( ( T 

x ?; - 


(^(w T x— b 


— ln(l + e 


i =1 


n 


V w £ = -'£ 


Zi*i 


i —1 


n 


d„E = V 


l _|_ e -(zi\v T Xi-Zib) 


Zi 


i —1 


l e -(^w T x— Zib) ' 


The approximation sequence is dehned recursively by 

w (i+i) _ w (j) _ ^ v w £'(w^\ b^) 
1) = — rj d\)E(yv^\ b^), 


1 


where 77 > 0 is the learning rate. 

Remark 5.4.1 There are also other error functions one can consider. For 
instance, one simple error function is the number of misclassihed points. If 
m\ and m_ 1 represent the number of misclassihed black and white points, 
respectively, then the error becomes E — m\ + m_The haw of this error 
function is its discontinuity with respect to parameters w and 5, and hence 
no differentiable method would work for it. 


5.4.3 Learning with the square difference cost function 

Consider an abstract neuron as in Fig.5.1. Its inputs are x T = (aq,... , aq,), 
the output is y — cr(w T x — 6), with weights w T = ( 7 /q,..., w n ) and bias b. 
We shall assume the cost function is given by 

C{w,b) = 1( y-zf 


1 






152 


Deep Learning Architectures, A Mathematical Approach 


where the target, z, is a number in (0,1). The minimum of C( w, b ) is eqnal to 
zero and is achieved for y — z. This occurs when the weights and bias satisfy 
the linear eqnation w T x — b — a~ 1 (z). Since this eqnation has a hyperplane 
of Solutions, (rei, • • •, re n , 6), the minimum of the cost function is not unique. 
Therefore, learning a real number, z E (0,1), can be done exactly in multiple 
ways. This is no longer true for the case when the neuron learns a random 
variable. We shall deal with this case next. 

Assume the input is an n-dimensional random vector X — (Xi,..., X n ) 
and the target Z a one-dimensional random variable taking valnes in (0,1). 
Consider m measurements of the variables (A, Z ) given by the training set 

{(x< i y( i )),...,(x( m yw)}, 

with and V- 1 £ M, 1 < j < m. 

The cost function is given by the empirical mean 

1 m 2 
C{w, b ) — E [(Y — Z) 2 ] — - ^<t(w t x^ — b) — . 

If an exact learning would hold, i.e., if C = 0, then w T x^^ — b — (j _1 (z^^), 
1 < j < m. Therefore, (re, b) satisfies a linear system with m equations and 
n + 1 unknowns. Since in practice m is much larger than n (i.e., the number 
of observations is a lot larger than the input dimension), the previous linear 
system is incompatible, so exact learning does not hold. 

In this case the minimum of the cost function is computed using the 
gradient descent method. For this it suffices to compute the partial derivatives 
with respect to weights and bias: 

777 / 

^ E (a(w T x ( ^ - b) - z^)<r'(w T x^ - b)x^ ] 

3 = 1 

—— —-(cr(w T x^^ — b) — z^)<j'( w T x^ — b). 

db m v 7 7 v 7 

3 = 1 

The approximation seqnence for the optimal valnes of parameters is given by 

(w (^+ 1 ),^+ 1 )) = (w {k \bW)-ri(V w C,V b C), 
with rj learning rate. 

5.5 Linear Neuron 

The linear neuron is a neuron with a linear activation function, (p{x ) = x, 
and n random inputs. Its learning algorithm uses the least mean sqnares cost 



Abstract Neurons 


153 



Figure 5.10: Linear neuron with bias b and linear activation function. 

function. It was actually implemented as a physical device by Widrow and 
Hoff [125]. It can be used to recognize patterns, data filtering and, of course, 
to approximate linear functions. 

The neuron has n inputs, which are random variables, Xi, ... ,X n . The 
weight for the input Xj is a number denoted by Wj. The bias is considered 
to be the weight wq — 6, see Fig. 5.10. Consider Xq — —1, constant. We shall 
adopt the vectorial notation 



{Xo\ 

( W o\ 


X i 


W\ 

X 

• • • 

, w 

• • • 


qj 


V J 


Given that the activation function is the identity, the neuron output is given 
by the one-dimensional random variable 

n 

Y — wjXj — w T X — A 7 w. 

3 = 0 

The desired output, or the target, is given by a one-dimensional random 
variable Z. The idea is to tune the parameter vector w such that Y learns 
Z\ the learning algorithm in this case is to minimize the expectation of the 
squared difference, so the optimal parameter is 

w* — argminE[(Z — Y) 2 ]. 

w 

We need to hnd w*, and we shall do this in a few ways. Some of these methods 
have more theoretical valne, while others are more practically. 

Exact solution A computation shows that the previous error function is 
quadratio in w: 

E[{Z-Y) 2 } = E[Z 2 - 2ZY + Y 2 } = E[Z 2 - 2ZX T w + (w T X)(X T w)' 

= E[Z 2 ] - 2E[ZX t \w + w t E[II t ]w 

/ti rri 

= c — 2b w + w Aw. 












154 


Deep Learning Architectures, A Mathematical Approach 


The coefficients have the following meaning: c — E[Z 2 ] is the second cen- 
tered moment of the target Z, b = E[ZX] is a vector measuring the cross- 
correlation between the input and the output, and 


A = E[XX t } 


( E[X 0 X 0 ] E[X 0 Xi] 
E[XiX 0 j E[XiXij 


E[X 0 X n ] \ 
E[VX n ] 


V E[X„X 0 


E[X n Xi] 


E[X n X n ] ) 


is a matrix describing the autocorrelation of inputs. In the following analysis 
we shall assume that A is nondegenerate and positive definite (i.e., the inputs 
are coherent). 

Denoting the aforementioned real-valned quadratio error function by 

£(w) — c — 2b 7 w + w 7 dw. 


its gradient is given by V w £(w) = 2Aw — 2b, see Example 4.1.3. The optimal 
weight w* is obtained as a solution of V w £ (w) = 0, which becomes the linear 
system ^4w = b. Since A is nondegenerate, the System has the unique solution 
w* = ^4 -1 b. This is a minimum point, since the Hessian of the error is given 
by Hg(w) = 2A, with A positive definite. 


Gradient descent method In real life, when n is very large, it is computa- 
tionally expensive to hnd the inverse A -1 . Hence, the need of a faster method 
to produce the optimal weight, w*, even if only as an approximat ion. This 
can be seen as a trade-off between the solution accuracy and the computer 
time spent to hnd the optimum. 

In this case, the gradient descent method is more practical. The error 
function £(w) is convex and has a minimum, see Fig. 5.11. We start from an 

arbitrary initial weight vector = (iOq°\ ..., Wn^) T E R n . Then construet 
the approximation sequence (wdehned recursively by equation (4.2.4) 


w (j+l) _ w (j) _ 77 VwC( W ^^) 

— w (i) _ 2rj(Aw^ — b) 

— (I n — 2rjA)w^ + 2r]h 
= Mw® + 2ryb, 


where M — I n — 2rjA, with I n denoting the unitary n-dimensional matrix. 
Iterating, we obtain 

w W = M j w (0) + (M J_1 + M j ~ 2 +-hM + I n ) 2r]b 

= M j w^ + (I n - M J )(I n - M)~ l 2r]b 
= M j w (°) + (I n - M^A^b. (5.5.3) 






Abstract Neurons 


155 



Figure 5.11: The quadratio error function has a global minimum w*. 


Assume the learning rate 7] > 0 is chosen such that lim^^ = O n . Then 
taking the limit in the previous formula yields w* = hm^-^ = A _1 b, 

which recovers the previous resuit. 

Now we return to the assumption M J = O n . Since M is a sym- 

metric matrix, its eigenvalues are all real. Similarly with the method applied 
in Example 4.2.7, this assumption is equivalent to showing that the eigenval¬ 
ues {A^} of M are bounded, with |A^| < 1. We do this in two steps: 


Step 1. Show that A^ < 1. 

Let A i denote an eigenvalue of M . Then det(M — AAn) — 0. Substituting for 

1 — A 1 — A 

M, this becomes det (A--I n ) = 0. This implies that ol{ — 


is 


2r] 7 x 27] 

an eigenvalue of matrix A. Since A is positive definite and nondegenerate, it 

follows that cq > 0, which implies that A^ < 1. 


Step 2. Show that A^ > 0 for 7] small enough. 

The condition A^ > 0 is equivalent to — < —, where we used that A has 

% % 

positive eigenvalues. This can be written in terms of 7] as 2t] < —. Hence, 

OLi 

the learning rate has to be chosen such that 


0 < 7] < min 

i 



(5.5.4) 


The closed-form expression (5.5.3) is not of much practical use, since it con- 
tains the inverse A -1 . In real life we use the iterative formula 


w (j+i) _ m w ( j ) 2ryb, 


where the learning rate r/ satisfies the inequality (5.5.4). 











156 


Deep Learning Architectures, A Mathematical Approach 


We shall estimate next the error at the jth iteration, ej 
using equation (5.5.3). We have: 

w ti) _ w * — + (I n — M- 7 )w* — w* 

= M j ( w (0) -w*). 


W 


(j) _ 


w 


* 


Using that |Mx| < ||M|||x| for all x G R n , we have 

= \M j (w {0) - w*)| < ||M|| J '|(w (0) - w*)| = jJd. 


e j = 


wU) _ w * 


where p — ||M|| is the norm of M (considering M as a linear operator), 
and d — |(w(°) — w*)| is the distance from the initial valne of the weight 
to the limit of the sequence. Since the norm of the matrix M is its largest 
eigenvalne, ||M|| = max^ A^, and A i < 1, then p G (0,1), and hence p J d —> 0 
as j -G oo. 


Gradient estimates For the sake of computer implementations, the error 
function ^(w)—K\[Z—Y) 2 ] has to be estimated from measurements, 


where x.^ — (x^ • • • x$)- The empirical error is 


m 


m 


= — E( 


Z U) _ W T X 



3 = 1 


--Es 2 

m J 


3= 1 


where ej = — w T x^T is the error between the desired valne z^ and the 

output valne w T x^T. The number m is regarded as the size of the mini-batch 
used for computing the empirical error. 

In order to save computation time, a crude estimation is done, i.e., m — 1, 
which uses a single sample error, case in which the previous sum is replaced 
by only one term, — e 2 -. This use of a mini-batch of size of 1 is called 

online learning. The use of one training example at a time is prone to errors 
in the gradient estimation, but it turns out to be fine as long as it keeps the 
cost function decreasing. In this case the gradient is estimated as 


Y — V w 6j — 2 6jdyjr6j — 2ejd^ w ( K z 


W - w T x 



= —2edX^\ 


3 ' 


Applying now the gradient descent method, we obtain 


w (j+l) _ w (j) _ 7yV w ^ 


— w (i) _ + 2rjCjX^\ 

Substituting for 6j = z^ — w T x^T yields 

w (j+i) _ w (i) _|_ 2rj(z^ — w T x (j) )x (j) 
where x^T is the jth measurement of the input X . 


5 


(5.5.5) 














Abstract Neurons 


157 


t k = z k-Sk r 



z 


k 



Figure 5.12: Adaline neuron. 


Remark 5.5.1 If the inputs are random variables normally distributed, Xj ~ 
N(nj, cr|), j = 1,..., n, then the output signal, Y = ^2^=1 w jXj + 5, is a uni- 

variate Gaussian random variable, Y N{yv T ii + 6,w T iw). This follows 

from 


E [Y] 
Var{Y ) 


rejE[Xj] + b = w T /i + b 

Var(^2 WjXj) = Cov(^2 ™j x j , ^ w,Ii) 

Cov(Xj , Xi)WjW{ — w t Aw, 
j 


where = Cov{X^ Xj) and [i T — (/ii,..., fi n ). 


Remark 5.5.2 There is a case when the inverse matrix A -1 is inexpensive 
to compute. If the inputs X{ are independent and have zero mean, E [Xj\ — 0, 
then the inverse matrix A~ 1 is diagonal with A~^ — 1/E[X 2 ]. 


5.6 Adaline 

The Adaline (Adaptive Linear Neuron) is an early computing element with 
binary inputs {±1} and signum activation function S(x) = sign(x) developed 
by Widrow and Hoff in 1960 , see Fig. 5.12. It was built out of hardware and 
its weights were updated using electrically variable resistors. The difference 
between the Adaline and the Standard perceptron is that dnring the learning 
phase the weights are adjusted according to the weighted sum of the inputs 
(while in the case of the perceptron the weights are adjusted in terms of the 
perceptron output). 

The Adaline is trained by the a-LMS algorithm or Widrow-Hoff delta 
rule, see [127]. This is an example of algorithm that applies the minimal 














158 


Deep Learning Architectures, A Mathematical Approach 


disturbance principiet If w k and x k denote, respectively, the weight and input 
vector at step fc, then the update rule is 


6 k X k / x 

Wfc+i = ™k + «|^|2 5 (5.6.6) 

where a > 0 and e k — z k — w k x k denotes the error obtained as the difference 
between the target z k and the linear output sk — w k x k before adaption. The 
aforementioned update rule changes the error as in the following 


A e k = 


A(z k - w k x k ) = A (z k - x k w k ) = -x£Aw k 
T ( \ T e kXk 

-x k (w k +i -w k ) = -x k a t—- -a 

—ae k . 


.T 

'k 

Ck% k x k 


\x k 


\x k \ 


This computation shows that the error gets rednced by a factor of a, while 
the input vector x k is kept fixed. A practical range for a is 0.1 < a < 1. 
The initial weight is chosen to be wq — 0 and the training continues until 
convergence. 

It is worth noting that the if all input vectors x k have equal length, then 
the a-LMS algorithm minimizes the mean-sqnare error and the updating 
rule (5.6.6) becomes the gradient descent rule, see Exercise 5.10.9. 


5.7 Madaline 

Madaline (Multilayer adaptive linear element) was one of the hrst trainable 
neural networks used in pattern recognition research, see Widrow [126] and 
Hoff [80]. 

Madaline consists of a layer of Adaline neurons connected to a fixed logic 
gate (AND, OR, etc.), which produces an output, see Fig. 5.13 a. Only the 
weights associated with the Adaline units are adaptive, while the weights of 
the output logic gate are fixed. 

For instance, with suitable chosen weights, the Madaline with two Adaline 
units and and AND output gate can implement the XNOR function given by 


Q 

Adapt to reduce the output error for the current training data, with minimal distur¬ 
bance to responses already learned. 






Abstract Neurons 


159 




Figure 5.13: a. Madaline with two Adaline units. b. Madaline implementation 
of XNOR function. The Madaline takes value 1 on the shaded area and value 
— 1 in rest. 


X\ 

X2 

XNOR 

1 

i 

1 

1 

-i 

-1 

-1 

-i 

1 

-1 

i 

-1 


The separation boundaries are given by two lines, see Fig. 5.13 b and Exercise 
5.10.10. 

5.8 Continuum Input Neuron 

This section introduces the concept of continuum input neuron , which is a 
neuron with an uncountable infinite number of weights. The concept will be 
used later in sections 7.3 and 10.9 for the construction of continuum neural 
networks. The learning in the context of continuum input neurons will be 
approached in section 10.6. 

It is interesting to notice the relation with the Wilson-Cowan model [129] 
of excitatory and inhibitory interactions of model neurons, which also uses a 
continuous distribution of neural thresholds in a population of neurons. 

A continuum input neuron has the inputs x continuum over the interval 
[0,1]. The weight associated with the value x is given by w(dx ), where here 
re is a weighting measure on [0,1]. The hrst half of the computing unit will 










160 


Deep Learning Architectures, A Mathematical Approach 


integrate the inputs with respect to the measure w. The output function in 
this case takes the form 

y — cr(^J xdw(x)^j. (5.8.7) 


In particular, this is applicable to the situation when the input is a random 
variable X taking values in [0,1]. If w represents the distribution measure of 
X, then the neuron output depends on its expectation, y — cr(E[X]). 

Depending on the particular type of the weighting measure employed, 
formula (5.8.7) provides a general representation scheme for representing a 
neuron with either a finite number of inputs as previously discussed, a count- 
ably infinite number of inputs, or an uncountably infinite number of inputs. 
We shall provide next a few examples of neurons within the framework of 
some particular measures. 


Example 5.8.1 (Neuron with a Dirae measure) Let xq G [0,1] be fixed 
and consider the Dirae measure fi XQ sitting at xq, defined by 


£>X0 (A) 


1, if xo G A 
0, if xq ^ A, 


for any measurable set A G £>([0,1]). This corresponds to the case when the 
random variable X takes only the valne xq. The output function in this case 
is a constant: 

, r 1 


y = cryj xdS X0 (x)j = <j(x 0 ). 

Since there are no parameters to adjust, there is nothing to learn here. 

In order to to avoid confnsion, in the next examples we shall denote the 
weighting measure by fi and the weights at the points by 


Example 5.8.2 (Neuron with discrete measure) Let E — {xi,... ,x n } 
be a finite subset of [0,1] and consider the discrete measure 

K A ) = E VA € B([ 0,1]), 

Xi^E 

where the positive number wi is the mass attached to the point X{. The output 
function is given by 



which corresponds to the classical neuron model, see section 5.3. The learning 
algorithm in this case adjusts the weights Wj of the possible error function 

\ U 2 
F(w) = -\a( J2 w iXi) ~ z) 

i =0 


5 



Abstract Neurons 


161 


z being the desired value of the neuron, given the observation x T — (xi,..., x n ) 
In the case of m observations {(x^, z k )}k=i^m the error function takes the 
form 

.. m n 2 

F (w) = 2 H 0 7 (XI “'i*?) “ **) • 


fc = l 


i =0 


The optimal weights will be found, for instance, using the gradient descent 
algorithm. 

Example 5.8.3 (Neuron with Lebesgue measure) Let fi be an absolute 
continuous measure with respect to the Lebesgue measure dx i . By Radon- 


Nikodym theorem there is a nonnegative measurable function p on [0,1] such 
that d/i(x) = p(pc)dx. Therefore, the output function 


y = ° 


xd/i(x)j —a(^J xp{x) dx 


depends on the weight function p{x). The learning algorithm has to optimize 

the neuron by choosing the optimal function p(x), which minimizes the error 

2 

functional F(p) = \ fo x p ( x ) dx^j — z^j . 

If fi is a probability measure, i.e., //([0,1]) = 1, then p(x) is a density 
function. The optimality problem implies that p{x) is a critical value for the 
the Lagrange multiplier functional 


L(p) — 2 ( a ( J x p ( x ) d x 


— z) — X 


p{x) dx — 1^ . 


In order to hnd the optimal density, p, we consider the variation given by the 
family of density functions (p e ) e >o, with p e (x) — p{x) + ep(x), where p is a 


continuous function. We have 


dpe(x) 


— p(x), with f Q p(x) dx — 1. The fact 


that p{x) is a minimum for the functional L(p) implies 


dL(p e (x)) 


= 0. 


On the other side, an application of the chain rule provides 


e=0 


dL{p e {x)) 

de 


xp e (x) dx^j — z^cr' J xp e {x)dx s j J x-J^dx 


A / ^ dx 
J o de 


and hence 

d,L(p e (x)) 

de 


e=0 


xp(x) dx^j — z^j cr' J xp(x) dx^j J xp(x)dx, 



162 


Deep Learning Architectures, A Mathematical Approach 


Since for 0 < x < 1 we have / x(p{x) dx < / (p(x) dx = 0 and <j'(u) = 

Jo Jo 

cr(^)(l — cr(?i)) 7 ^ 0, then the above variation vanishes if p{x) satisfies 


/ xp(x) dx = (7 1 (z). 

J 0 


(5.8.8) 


In the case of a repeatable experiment, the constant output z is replaced 
by the target random variable Z : O —> [0,1], where (O, H, P) is a probability 
space. The problem asks for finding the random variable X : O —> [0,1] such 
that the amount 


E[ct(e[X]) - ^] 2 


<t(e[X]) - Z{lu)] 2 <W{lu) 


is minimized. It is known from the theory of random variables that a necessary 
condition for this to be achieved is that E [Z] — <r^E[X]^. Consequently, X 
is a random variables with the mean 


E[X] = <7 _1 (e[Z]), 


which is equivalent to (5.8.8). 


5.9 Summary 

This chapter presented several types of neurons. Some of them are classical, 
such as the perceptron, the sigmoid neuron, or the linear neuron; others 
are of more theoretical value, such as neurons with a continuum input. Each 
neuron is characterized by an input, a System of weights, a bias, an activation 
function, and an output. 

The perceptron is a neuron model with a step activation function. The 
neuron hres if the incoming signal exceeds a given threshold. Its output has a 
jump discontinuity. A perceptron can be used as a binary classifier to classify 
two linearly separable clnsters in the plane. It can also be used to learn the 
Boolean functions AND and OR, but it cannot learn the function XOR. 

The sigmoid neuron has a sigmoid activation function. Its output is con- 
tinuous and can saturate if the signal is either too large or too small. The 
sigmoid neuron can learn using logistic regression, which is equivalent to the 
fact that the weights are also given by the maximum likelihood method. Some 
applications of sigmoid neurons are presented, such as binary classihcation 
and prediction of default probabilities. 

Linear neurons have random variable inputs and the activation function 
is linear. Their optimal weights can be obtained by a closed-form solntion. 
However, in practice it is easier to train them by the gradient descent method. 





Abstract Neurons 


163 


If the input to a neuron is a continuous variable in a given interval and 
the weights System is replaced by a measure on that interval, then we obtain 
a neuron with a continuum input. There are several types of neurons, corre- 
sponding to different types of measures, such as Dirae, discrete measures, or 
Lebesgue. 

5.10 Exercises 

Exercise 5.10.1 Recall that -ix is the negation of the Boolean variable x. 

(a) Show that a single perceptron can learn the Boolean function y — xi A-iX 2 , 
with xi, X 2 G {0,1}. 

(b) The same question as in part (a) for the Boolean function y — x i V — 1 . 7:2 , 
with x\,X 2 6 {0,1}- 

(c) Show that a perceptron with one Boolean input, x, can learn the negation 
function y — -ix. What about the linear neuron? 

(d) Show that a perceptron with three Boolean inputs, xi,X 2 ,X 3 , can learn 
xi A X 2 A X 3 . What about xi V X 2 V X 3 ? 

Exercise 5.10.2 Show that two finite linearly separable sets A and B can 
be separated by a perceptron with rational weights. 

Exercise 5.10.3 (a) Assume the inputs to a linear neuron are independent 
and normally distributed, Xi ^ A?( 0 , cr?), i — 1 ,...,n. Find the optimal 
weights, w*. 

( b ) A one-dimensional random variable with zero mean, Z, is learned by a 
linear neuron with input X. Assume the input, A, and the target, Z, are 
independent. Write the cost function and find the optimal parameters, rc*. 
Provide an interpretation for the resuit. 

(c) Use Newton’s method to obtain the optimal parameters of a linear neuron. 

Exercise 5.10.4 Explain the equivalence between the linear regression algo- 
rithm and the learning of a linear neuron. 

Exercise 5.10.5 Consider a neuron with a continuum input, whose output 

xrf//(x)J. Find the output in the case when the measure is 

M — &X0 • 

Exercise 5.10.6 (Perceptron learning algorithm) Consider n points, 
Pi, ..., P n , included in a half-circle, and denote by xi,..., x n their coordinate 




164 


Deep Learning Architectures, A Mathematical Approach 



Figure 5.14: For Exercise 5.10.6. 


vectors. A perceptron can leam the aforementioned half-circle by the follow- 
ing algorithm: 

1. Start from an arbitrary half-circle determined by its diameter and a unit 
normal vector wq. Then select an incorrectly classified point, P^ 0 , i.e., a point 
for which (ico,x^ 0 ) < 0, see Fig. 5.14. 

2. Rotate the diameter such that the new normal is w\ — wq + x^ 0 . Show that 
the point Pi 0 is now correctly classified. 

3. Repeating the previous two steps, we construet indnctively the seqnence of 
vectors (ic m ) m such that w m +i = w m - l-x^ m , where Pi rn is a point misclassified 
at step m. Show that the process ends in a finite number of steps, i.e., there 
is IV > 1 such that (w N ,xj) > 0, VI < j < n. Find an estimation of the 
number N. 

Exercise 5.10.7 Modify the perceptron learning algorithm given by Exercise 
5.10.7 for the case when the points Pi, ..., P n are included in a half-plane. 

Exercise 5.10.8 Let 1 a(%) denote the characteristic function of the set A , 
namely, 1 a{%) = 1 if x E A and 1 = 0 if x ^ A. 

(a) Show that the function 

( p{x\ 1 X2) f {x2>a^i+0.5} (*^15 ^ 2 ) A 1 {x2<x\— 0.5} (^1 5 ^ 2 ) 

implements XOR. 

(■ b ) Show that the XOR function can be implemented by a linear combina- 
tions of two perceptrons. 




Abstract Neurons 


165 


Exercise 5.10.9 Show that if all input vectors x & have the same length, then 
the a-LMS algorithm minimizes the mean square error and in this case the 
updating rule (5.6.6) becomes the gradient descent rule. 

Exercise 5.10.10 Find the weights of a Madaline with two Adaline units 
which implements the XNOR function. 


® 

Check for 
updates 


Chapter 6 

Neural Networks 


We have discussed so far the case of a single neuron. In this section we shall 
consider multiple layers of neurons whose outputs are fed into other layers 
of neurons, forming neural networks. A layer of neurons is a processing step 
into a neural network and can be of different types, depending on the weights 
and activation function used in its neurons (fully-connected layer, convolution 
layer, pooling layer, etc.) The main part of this chapter will deal with training 
neural networks using the backpropagation algorithm. 

Since the study of a neural network is heavily based on notation, we shall 
start with an warm-up example. 


6.1 An Example of Neural Network 

Consider two neurons having identical inputs, xi,X 2 and (distinet) outputs 
2/1,2/2 5 respectively. The outputs are fed into the third neuron, with the output 
t/, see Fig. 6.1. We assume the activation function for all neurons is the same 
and is denoted by <fi. Each neuron has a bias that is considered as a weight 
of a “fake” input equal to — 1. Now, we assemble these neurons together into 
an equi valent ly neural net, as given in Fig. 6.2. This forms a layer of two 
neurons in the middle, called a hidden layer. 

The synapses between the neurons are depicted by edges and each of them 
is associated with a weight, denoted by The indices have the following 
signihcance: the upper index, (£), represents the index of the layer the synapse 
feds into. The index l — 1 is used for the weights that fed the hidden layer 
and the index i — 2 is used for the weights that enter the output layer, which 
consists of the last neuron. The lower index i indicat es the input neuron. The 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_6 


167 



168 


Deep Learning Architectures, A Mathematical Approach 



Figure 6.1: Two neurons with identical inputs , xi, X 2 , and outputs yi, 1 / 2 , are 
fed into a third neuron with the output y. 


index i — 0 is always used for the bias, while i — 1,2 represent the index 
inputs. The second lower index, j, corresponds to the output neuron, i.e., the 
index of the neuron the synapse points to. 

The inputs form the first layer of the network. They are denoted with 
a zero upper index: Xq 0 ' 1 = —1, x^ and x£\ Note that the biasses of the 
neurons in the hidden layer are denoted by and — w o 2 1 w hile 

the bias for the neuron in the output layer is b^ = . 

The hrst neuron of the hidden layer collects the information from the 
input layer into the signal 


,(i) _ 

1 


,(i) r (o) 
'01 x 0 


, ( 1 ) ( 0 ) , 

r> v / - fi 1 1 ' / ry* v / I /i /1 v ' ry* v ' I rt /1 v 7 ry* 

o i — UJ r\ i ttn i - uJ ^ ^ «r 2 I (Ju oi 


(l) r (0) 
21 x 2 


2 

(i) (o) 

1=1 



Then it applies the activation function <j> on the signal to obtain the output 


V i 



<k4 1} ) = 



(i)„(0) 


Wg X 


i=1 



Similarly, the second neuron of the hidden layer collects the information 



Neural Networks 


169 


into the signal 


,(i) _ 7/; (i)^(°) , w ,(i) T (o) , w ,(i)_(o) .... 

9 07Q2 ^0 ^ 1 9 «X ~| | CC /99 «X 9 - 


12 ^1 


'22 2 


(i) (o) 

2_^ W i2 X i 
i =1 


-6 


(i) 
2 > 


and then it applies the activation function 4> to obtain the output 


(i) 

y2 = xy 


- ^( 4 1} ) = </>( 2 


au°) _ 6 (i) 


«>i2 


?’=1 


These relations can be manipulated easier if they are written in the matrix 
form. The signals in the hidden layer are related to the inputs by 


/Ji)\ 


v4 1} y 


/ ,..(1) ,..(1) \ / x (0) \ / 6 (1) ^ 


w n w 2i 




w 


( 1 ) „J 1 ) 


12 


W 


I V 4 0> / 


22 


W 


1 

( 1 ) 

2 


/ 


This can be written in the condensed form as 


5 (1) = Vh (1)T X (0) - 6 (1) 


where represents the signal vector, is the weights matrix, X^ 0 ) is the 
input vector, and denotes the bias vector for the neurons in the hidden 
layer. The transpose notation follows from the way we wrote the indices. 

The output of the hidden layer is the vector 


xW = 


y 1 


JJ2. 


/ T (i)\ 

*X ^ 


\Xj 


Vk4 1} v 


= <p( s 



where we adopt the convention that 0 applied on a vector acts on its com- 
ponents. 

The last two formulas imply the useful relation X^ — <j)(w^ T , 

providing the output vector of the hidden layer as a function of inputs. The 
multiplication of X^ 0 ) by the matrix is a linear transformation. Sub- 

tracting b^ l \ this becomes an affine transformation. The activation function 
(j) adds nonlinearity to the output. 

The neuron in the last layer collects the incoming information into the 
signal 


S ( 2 ) = 


„( 2 ) _ „,( 2 ) (i) ( 2 ) ( 1 ) ( 2 ) ( 1 ) 

o^ — (x/q 1 «x q i - uj «x2 ~i tx21 <x2 


E 

1=1 


(2)J1) _ ft (2) 


XI 


1 



170 


Deep Learning Architectures, A Mathematical Approach 



y = x 


( 2 ) 

i 


Figure 6 . 2 : Neural network with one hidden layer, having two inputs, xi,X 2 , 
and two neurons in the hidden layer. 


In the matrix form, this becomes 

s (2) = W ( 2 )T X (1) - 6 (2) , 
where the matrix coefficients are given by 

W (2)T =(«;p 1 ) «4?) 

and the input vector is 



The signal coming out of the last neuron, s ® — s[ \ is a one-dimensional 
matrix. The lower index, 1, indicates that there is only one neuron in the 
output layer. The network output is obtained applying the activation function 
on the signal «s^ 2 ) as in the following: 

Y = 4 2) = 0(s (2) ) = <^(V (2)T X (1) - b^y 

The output was denoted by Y to be consistent with the notation used in 
previous sections and also by the upper script ( 2 ) indicates that this 








Neural Networks 


171 


output is on the second layer (actually third, if we count the input layer) and 
the lower index, 1, means that there is only one neuron in the output layer. 

Using the formula for X W obtained previously, we can express the final 
output in terms of the initial inputs as 

Y = 4>{w (2)T 4>(W {t)T - 6 (1) ) - 6 (2) V (6.1.1) 

We notice that the affine transformations, consisting in multiplication by 
weight matrices and subtracting bias vector, alternate with nonlinear trans¬ 
formations, which are indnced by the activation fnnction <fi. If consider the 
mapping f w ^ : R 2 —> R dehned by 

f w ,b(X) = <p(w^ T <P(W^ T X - b «) - b ^), (6.1.2) 

then we call f w ^ the input-output mapping of the given neural network. 

Remark 6.1.1 Assume all the neurons from the previous network are linear 
neurons, i.e., the activation function is (f>(x) — x. In this case, the output 
becomes 

Y = (W W W {2) ) T X W - lV 2)T 6 (1) - 6 (2) = W T X {0) - b. 

This suggests that the entire neuronal network is equivalent with only one 
linear neuron having the weights matrix W — and bias vector 

b — W^ b W — b^ 2 \ Hence, the study of neural nets of linear neurons is 
rednced to the study of only one linear neuron. 

Remark 6.1.2 Assume now that all the previous neurons are perceptrons, 
with the activation function (f>(x) — H{x). The output of the network in this 
case is given by 

Y — H H(x\Wy^ +^2^21^ — (x\~\~X2W^ ~ b^) — b^^j . 

It is worthy to note that this network can learn exactly the XOR function 
introduced in section 5.2. Choosing the weights such that the output becomes 

Y — f(x i, X 2 ) = H ^ H [x + y — 0.5) — H[x + y — 1.5) — 0.5^, 

we easily verify that /(0, 0) = 0, /(1,1) = 0, /(0,1) = 1, and /(1, 0) = 1. The 
graph of the fnnction f{pc,y) is a square-shaped canyon with the bottom along 
the diagonal line y = x, having the bottom width equal to \[2. The need of 
three perceptrons to learn the XOR function is explained heuristically in the 
following. It can be shown that that each of the functions y\ — x\ A -1X2 
and 2/2 — ~^x\ l\X 2 can be learned by one perceptron. It is needed one 
more perceptron to learn the function yi V 1 / 2 , which is the XOR function, 
(x\ A -i X2) V (-1 x\ A X2). 


172 


Deep Learning Architectures, A Mathematical Approach 




a 


b 


Figure 6.3: a. The model overfits given data . b. The model generalizes well. 


6.1.1 Total variatiori and regularization 


The generalization performance of a neural network depends on the output’s 
sensitivity with respect to its input. Assume we have two networks, which 
have been trained on the same data: the one in Fig. 6.3 a, which overfits data, 
and the one in Fig. 6.3 b, which generalizes well. The geometric difference 
between these models consists in the fact that in the former case the total vari- 
ation of the output is larger than in the latter. The total variation is related 
to the sensitivity of the output by a mathematical formula. If Y = f(X^) 
denotes the input-output mapping of the network, the total variation of / 
represents the total cumulation effect of local absolute variations 

V(f) = J \df(X )\, 


where the integral is taken on the input space. Therefore, V(f) plays an 
important role in evaluating the generalization performance of a network. 
This is why a formula for the differential dY — df(X ) is needed and shall be 
computed in the following. 

Formula (6.1.1) provides the net output in terms of its input. Assume 
that all weights and biasses are kept fixed. The changes in the output Y 
when small changes in the input A" 101 occur can be computed by an iterative 
application of the chain rule, as in the following: 


dY 


dY (2) _ dY dsW 
ds ( 2 ) dS ~ ds( 2 ) dXW 
dY ds& dX^ (1) 
dsl 2 ) dXW dsW dS 
dY ds& dX^ ds 


dX {l) 


dX (0) 


dsW dXW dsW dXW 
















Neural Networks 


173 


tt • , dY lt . f2K , dX ±ll m . „ ds<® 

Usmg that ^ = 0 (s y J ) and —tt— = 0 (s v as well as 


ds( 2 ) 

and -—— = W *■ ' , we obtain 


dsd) 


dV 1 ) 


= jy( 2 )’ 


dX(°) 


dy = <//(s ( 2 ) )jy ( 2 ) V(s ( 1 ) )W" (1)T dX (0) . 


(6.1.3) 


It is important to remark the relation with regularization. We have noticed 
from the previous formula that the differential dY depends on the weights 
matrix W^\ £ E {1,2}. Keeping these weights small by requiring either an 
L 1 or L 2 norm constraint, ||Vh^||i < 1, or ||Vh ^||2 < 1, will decrease | dY . 
This has a further diminishing effect on the total variation, V(f), and hence 
this will improve the generalization performance. 

In the case when the activation fnnction is the logistic fnnction, 0 = <r, 
using the properties of the logistic, formula (6.1.3) becomes 

dY = ct(s (2) )(1 - < 7 (s ( 2 )))jy ( 2 ) T < 7 (s (1 ))(l - (j(s ( 1 ) ))lb (1)T dX (0) , 


with the convention that cr applied to a vector acts on its components. The 
previous sensitivity formula will be used for the purpose of noise removal in 
Exercise 6.6.5. 


6.1.2 Backpropagat ion 


Consider a cost function C(w,b), which measures the proximity between the 
network output, Y = f w ^(X), and the target Z. The cost is a smooth function 
of the weights, re, and biasses, b. Its gradient is given by 

vc = {V w cy b c). 


For the aforementioned example, this vector is nine-dimensional, depending on 
six weights and three biases. The gradient is needed for the gradient descent 
method. We shall start the computation backwards, from the last weights and 
biasses toward the hrst ones, by a procednre called backpropagation. 

We compute the partial derivatives of the cost function with respect to 
the weights having the upper script £ = 2. We shall also use that the cost 
C depends on T, which depends on through only, as Y — 0(s^ 2 ^). 
Hence, an application of the chain rule yields 


dC 
db^ 


dC 

dC ds ( p 

= 

dw ^ 
uw 01 

ds^ dw^ 

dC 

dC ds { p 

= jfT; 1 » 

dWy^ 

ds^ dwy± 

dC 

dC ds ( p 

- 4 2) 4 1} 

dw { 2i 

ds^ dw^i 





















174 


Deep Learning Architectures, A Mathematical Approach 


where we used the fact that x^\ and employed 


the notation 5^ — 


dC 


ds 


( 2 ) ' 

1 


Next, we compute the partial derivatives with respect to the weights and 
biases having the upper script £ = 1. The cost C depends on Y, which 

depends on through s^\ j — 1,2, because Y — (j)[w^ T (f)(s^) — b^ 

by (6.1.1). If j — 1, then chain rule yields 


For j — 


dC 

dC 

dC 


= 4 1} 4 0) 

db[ 1] 

dw m 

dsP 

dw 01 


dC 

dC 

dsP 

= 4 1 >4 0) 


dWy^ 

dsP 

dwP 


dC 

dC 

dsP 

- A (1) x (0) 

— 0 1 x 2 


dwg 

dsP 

dwg 

the chain rule implies 




dC 

dC 

dC 

ds { 2 ] 

- A (1) r (0) 

i—i 

^" C4 

-o 

UUJ 02 

ds { 2 } 

UUJ 02 


dC 

dC 

ds ( 2 ] 

_ x (1) r (0) 
— U 2 



ds { 2 ] 

dw[ 2 } 


dC 

dC 

ds ( 2 ] 

_ x (1) r (0) 

— 0 2 X 2 


dwg 

ds { 2 ] 

dwg 


= -5 


(i) 


i 


= -4 


(i) 


where we used the fact that sP — i wPxP ~ bP. and the notation 

j z — 1 o i j 

3 = 1,2. 


(D_ dC 


s r = 


ds) 


(1) 


To conclude the previous computations, we write 

dC 




dbf 

dC 

® w if 




°j x i 


(6.1.4) 


(6.1.5) 


It follows that the gradient of the cost function, VC, is known as long as 

(k) 

we hgure out the values of 5j . We shall show that the deitas of the hidden 
layer, $P and 5^\ depend on the delta of the hnal layer, as in the 


























Neural Networks 


175 




Figure 6.4: The dependence tree used for applying chain rule. 


following. Using that affects the cost C through 5 ® only, see Fig. 6.4. 
the chain rule provides 


^ = 


dC 


dC ds 


( 2 ) 


1 _ = 5 (2) dsj 


ds 


( 1 ) 

1 


ds { p dsP 


( 2 ) 


1 


( 6 . 1 . 6 ) 


Since the signal sp can be written as 


4 2) = - w $+wPfieP)+ 


it follows that 


ds 


( 2 ) 


1 


ds 


(i) 


= U!Pl<P'(sP) 


1 


Substituting in formula (6.1.6), we obtain 




(6.1.7) 


Similarly, using the dependence tree given by Fig. 6.4, we have 


<ri 1} = 


dC 


dC ds 


( 2 ) 


1 


ds 


(i) 


dsp ) dsp ] 


= 5 (2) 


d 


- w$ + w { p(j){s { p) + w^4>{s y 2 L ’) 


(2) ./ (1) 


1 


ds 


( 1 ) 


= 4 2)u 4iV( s 2 1} )- 


( 6 . 1 . 8 ) 


(6.1.9) 












176 


Deep Learning Architectures, A Mathematical Approach 


The last two formulas can be written in the vector form as the following 
backpropagation formula for deitas: 


fsp\ 

v 4 b 



vAivcv 


( 6 . 1 . 10 ) 


Hence, in order to find the deitas of the hidden layer, it suffices to compute 

the delta of the outer layer, S^\ We shall compute this in the following. 

( 2 ) 

The cost function, C(w,b) — dist (Y,Z), depends on s\ through the 
output, Y = f Wjb (X^), so an application of the chain rule provides 


«s 2, = 


ac ac aY ac a 


ds 


( 2 ) 


1 


dY ds V) 


dY ds W 


,/ (2)\ dC , (2) 
n s \ ) = xrA ( s 


dY 


i 


) 


( 6 . 1 . 11 ) 


where we used that the output is Y — 



). Substituting in (6.1.10) yields 


(s[A 

v4b 


dC 

dY 




/tagV^A 

vAiVCV 


( 6 . 1 . 12 ) 


The factor 


dC 


dY 

C — b(Y — Zf 


depends on the cost function considered. For instance, if 

ac 


then 


dY 


— Y — Z , where Z is the target function that 


needs to be learned by the network. Up to this point, all deitas have been 
computed, their formulas being given by (6.1.11) and (6.1.12). 


In order to minimize the cost function, C, we shall apply the gradi- 
ent descent method. The initialization is given by some arbitrary values 
of weights and biasses, (re^(0), 6^(0)), and the approximation sequence 

(w!-j\k), 6 ^(fc)), k > 0, is dehned recursively by 


bf\k + 1) 
wfpk+l) 


bf (k) - ??yy ( w i? (*0> b T ( k )) 


dC 


,(*) 




db j 


(£) V ij 


wfpk) - ( w ij\k), bf’(k )) 


dC 


Y) 


Y) 


to ® V " 


where the argument k represents the number of iterations. In the virtue of 
formulas (6.1.4)—(6.1.5) the previous recursive system becomes 


C (^+1) = C (k )+vY 

wf){k + 1) = wfhk) - r/Sf' 1 xf~ 1 \k), 





















Neural Networks 


177 




hidden layers 


a b 

Figure 6.5: a. One-hidden layer neural network; b. Two-hidden layer neural 
network. 

where r/ > 0 is the learning rate and xf ^ (k) is the zth input to the layer £, 
given the values of the weights and biasses at the kth run. The argument k 
used in the previous formulas represents the current nnmber of iterations. 

When all training examples have been used to update the network param- 
eters we say one epoch has been processed. The process of parameters update 
continues until the learning system hts the training data. The gradient descent 
is used until the algorithm has converged to a locally optimal solntion, fact 
that can be verified by checking whether the norm of the gradient has con¬ 
verged. If the model represents the training data perfectly, then the training 
loss can be made arbitrarily small. If the model underhts the training data, 
the training loss cannot decrease under a certain limit. 

Remark 6.1.3 The idea of backpropagation dates since the 1960s when it 
appeared in the context of control theory [18], [62], [33]. Then it started 
to be used to neural networks in the mid-1970s, [77], [78], [123]. The first 
computer implementations with applications to neural networks with hidden 
layers have been developed in the mid-1980s in the classical paper [108]. After 
a break in popularity dnring the 2000s, the method returned in the literature 
and applications due to the computing power of the new GPU Systems. 










178 


Deep Learning Architectures, A Mathematical Approach 


6.2 General Neural Networks 


We shall discuss next the case of a feedforward neural network, i.e., a network 
where the information flows from the input to the output. Fig. 6.5 a depicts 
the case of a neural network with 3 layers: the input layer that has 4 inputs, 
the middle or hidden layer, which has 4 neurons, and the output layer with 
only 1 neuron. Fig. 6.5 b represents the case of a neural network with 4 layers, 
which looks similar to the previous network, but having 2 hidden layers, each 
with 4 neurons. 

We start by presenting the notation regarding layers, weights, inputs, and 
outputs of a general neural network. 

Layers We shall denote in the following the layer number by the upper script 
£. We have £ — 0 for the input layer, £ — 1 for the hrst hidden layer, and 
£ — L for the output layer. Note that the number of hidden layers is equal to 
L — 1. The number of neurons in the layer £ is denoted by d^\ In particular, 
d is the number of inputs and d^ is the number of outputs. 

Weights The System of weights is denoted by wp , with l<£<L,0<i< 

d^~ l \ 1 < j < d^\ The weight vJp is associated with the edge that joins 
the ith neuron in the layer £ — 1 to the jth neuron in the layer £. Note that 
the edges join neurons in consecutive layers only. The weights vop — are 
regarded as biasses. They are the weights corresponding to the fake input 
xp — —1. The number of biases is equal to the number of neurons. The 
number of weights, withont biasses, between the layers £ — 1 and £ is given 
by d^~^d^\ so the total number of weights in a feedforward network is 
ZL d"- '^d^. For instance, the network in Fig. 6.5 a has 4x4 + 4x1 = 20 
weights, while the one in Fig. 6.5 b has 4x4 + 4x4 + 4x1 = 36 weights. 
The weights can be positive, negative, or zero. A positive weight encourages 
the neuron to hre, while a negative weight inhibits it from doing so. 

Inputs and outputs The inputs to the network are denoted by Xj°\ with 

1 < j < d^\ We denote by xp the output of the jth neuron in the layer £. 

Conseqnently, xp ^ is the network output, with 1 < £ < d^ L \ The notation 

xP — — 1 is reserved for the fake input linked to the bias corresponding to 
the neurons in the layer £ + 1. 

Consider the jth neuron in the layer £. In its hrst half, the neuron collects 
the information from the previous layer, as a weighted average, into the signal 



E 


(£) (e-i) 

w ij x i 


i =0 


G+ 1 ^ ) 

E 


(£) (e-i) 
w ij x l 


1=1 



(6.2.13) 


Neural Networks 


179 


In its second half, the neuron applies an activation function </> to the previous 
signal and outputs the value 


Xj 




d( £ -!) 


1=1 


(£) (£- 1 ) 
w ij x i 


b 


W' 


This can be written in the following equivalent matrix form: 

X O = 

where we used the notations 


(6.2.14) 


XW = (af } ,..., xff, W& = (df) . B (<) = (44 • • •, &$) T 

with 1 < z < and 1 < j < d^, and where we used the convention that 

the activation function applied on a vector acts on each of the components 
of the vector. We note that the input to the network is given by the vector 

, x^jp) T i while the network output is X^ — (x^\ ..., x^) T . 


6.2.1 Forward pass through the network 

Given a System of weights and biasses, one can use equation (6.2.14) to find 
the outputs of each neuron in terms of their input. In particular, one knds 
the output of the network, which is the output of the last layer, x^ — 
(x[ L \ ..., )’ w here d^ represents the number of network outputs (i.e. 

the number of neurons in the output layer). Going forward through the net¬ 
work means to find all the neuron outputs. This forward pass represents a 
target prediction step. 

This prepares the ground for the second stage of the method, of going 
backwards through the network, to compute the gradient of the cost function 
in order to update parameters set by the gradient descent method. Passing 
forward and backward produces a sequence of parameters corresponding to 
an improved sequence of predictions of the target. This process ends when 
no cost improvements can be made anymore. 


6.2.2 Going backwards through the network 

Let C(w,b) denote the cost function of the network, measuring a certain 
proximity between the target Z and the output Y — x^ L \ For the gradient 
descent purposes, we need to compute the gradient VC = (V W C, V^C). 

Since the weight joins the zth neuron in the layer t — 1 to the jth neu¬ 
ron in the layer it will enter in the composition of the signal produced 


180 


Deep Learning Architectures, A Mathematical Approach 


by the jth neuron of the layer £. Furthermore, no other signal in the £th layer 
gets affected by this particular weight. Therefore, chain rule provides 





(6.2.15) 


with no summation over j. We use the following delta notation 



to denote the sensitivity of the error with respect to the signal s^\ The second 
factor on the right side of (6.2.15) can be computed explicitly differentiating 
in relation (6.2.13) as 



Substituting in (6.2.15) yields 



°3 


(6.2.16) 


A similar analysis is carried out for the derivative of the cost with respect to 

U) U) 

the biasses. Since the cost C is affected by 5) ; only through the signal s- , 

we have 


dC 


dC ds) 


V) 


dh : 


((■) 


dsf dbf 


Differentiating in (6.2.13) yields 


(6.2.17) 


and hence (6.2.17) becomes 



■(i) 


dbf 3 ' 


(6.2.18) 


From relations (6.2.16) and (6.2.18) it follows that in order to find the 
gradient 

VC = (V^C, V b C) - sf - 1 ), (6.2.19) 


it suffices to know the deitas df. The construction of deitas is done by an 
algorithm called backpropagation and will be presented in the next section. 













Neural Networks 


181 


6.2.3 Backpropagation of deitas 

We start with the computation of deitas in the last layer, dj L \ with 1 < 
j < d( L \ This essentially depends on the form of the cost function. We shall 


assume, for instance, that 


d ( L ) 

C(w, b) = l Y"X x( i L) ~ z if = \ 


2 ^ 

3 = 1 


X^ — Z 


where z = (zi,..., z d (L)) T denotes the target variable. Using x^ — (f)(s 
then chain rule implies 


e = 


dC 


= (N - z M' ( y s T > )- 


( L ) 


d °) 


(L) 


3 


J 


( 6 . 2 . 20 ) 


In the case of some particular types of activation fnnctions, this can be com- 
puted even further. If the activation function is the logistic function, </> = cr, 
then 




dC 


= ( x f ] ~ z i) a( y s f ) )i l ~ 


(L) 


ds) 


( L ) 


,{L) 


Or, if the activation function is the hyperbolic tangent, t, then 


(L) _dC_ (L) 

3 o (L) 

dS 3 


= — - t2 ( s j L) ))- 


The next task is to write the deitas of layer i — 1 in terms of the deitas 
of the layer i. This will be achieved by a backpropagation formula. Chain 
formula provides 


<5 


(£- 1 ) 


dC 


dW dC dsf 


ds) 


(£- 1 ) 


E 


“ dsf ds { £- l> 


c/W 

E 

3 = 1 


* 




ds) 


W 




(t-i) 


( 6 . 2 . 21 ) 


The first equality follows from the dehnition of delta. The second one uses 
the fact that the signal ^ in the layer £ — 1 affects all signals, s^\ in the 

layer with 1 < j < d^\ Thus, the cost, C, is affected by 1 through 
all signals in the layer £, fact that explains the need of the summation over 
index j. The last identity uses again the dehnition of delta. 


dsf 


We shall compnte next the partial derivative — rjz iy, i.e., the sensitivity 

ds) 


of a signal in the layer £ with respect to a signal in the previous layer. This 
can be achieved differentiating in formula (6.2.13) as in the following: 














182 


Deep Learning Architectures, A Mathematical Approach 


ds ) 




d 


ds 


(e-i) 


ds) 


(£- 1 ) 


d C^- 1 ) 

E 

1=1 


b ) 


( 2 ) 


d 




(E 


b ; 




aS i 2=1 

= W ij^V S i )• 

Substituting in (6.2.21), we obtain the backpropagation formula for the deitas: 


e(t— 1) /// i 

$ = 0 (s 


"- ( '~ 1) )E' 5 5 

3 = 1 


(t) (t) 


( 6 . 2 . 22 ) 


This produces the deitas for the (f — l)th layer in terms of the deitas of the 
£th layer using a weighted sum. 

If the activation function, 0, is the logistic function, cr, the previous for¬ 
mula becomes 




,( 2 - 1 ) 


( 2 - 1 ) 




•( 2 ) (i) 
w ij 


3- 


1 


(6.2.23) 


Or, if the activation function, </>, is the hyperbolic tangent, t, the previous 
formula takes the form 


6 ) 


(t-i) _ 


= (i-t^r>))E«' 




•(t) (t) 

w \j 




: 1 


To conclude the last few sections, in order to find the gradient of the cost 
function given by (6.2.19), one uses the backpropagation formula of deitas 

(6.2.22) to compute and the forward propagation formula (6.2.14) to 

compute the outputs xf 





Neural Networks 


183 


6.2.4 Concluding relations 

The computations done in the previous sections can be concluded with the 
following set of master equations 






</>( y wfjxf- 1 '* — ©j, 1 < j < 

i— 1 

( x j L) - 1 <j< rf (L) 

(6.2.24) 

(6.2.25) 

d( £ ) 

3 = 1 

(6.2.26) 

°j 

(6.2.27) 

-sf. 

(6.2.28) 


The hrst one provides a forward recursive formula of the outputs in terms of 
weights and biases. The second formula deals with the expression of the delta 
in the output layer (under the assumption that the cost function is a sum 
of squared errors). The third is the backpropagation formula for the deitas. 
The last two formulas provide the gradient components of C with respect to 
weights and biasses. 

Equations (6.2.27)-(6.2.28) can be used to asses the sensitivity of the cost 
function, C(w, fr), with respect to small changes in weights and biasses. Using 
the formula for the differentiai of a function, yields 


dC 


E 


dC 

dwfj 


dw fj 


+E 


dC Mu 
ab® 3 






3 


db 3 


w 


where the summation is taken over z, j, T, with 1 < i < 1 \ 1 < j < d^ 

and 1 < i < L. 


6.2.5 Matrix form 

In order to avoid the complexity introduced by multiple indices, the afore- 
mentioned master equations can be written more compactly in matrix form. 

First, we need to introduce a new prodnct between vectors of the same 
dimension. If u and v are vectors in R n , then the elementwise product of u 
and v is a vector denoted by u © v in W 1 with components {u © v)j — UjVj. 






184 


Deep Learning Architectures, A Mathematical Approach 


Sometimes, this is referred to as the Hadamard product of two vectors. Its 
use simplifies considerably the form of the equations. 

Using the notations 








U V°1 5 • • • , 






ry V—' / (^) \ 

V^l 5 • • • 5 a 'j(L) ) 


.(0 \T 


'1 5 • ’ ’ 5 ° d(£) J 5 ^ V^l 5 ’ ’ ’ 5 ^(}{L) 

and the convention that (j) acts on each component of its vector arguments. 
equations (6.2.24)-(6.2.28) take the following equivalent form: 


XW 

S ( L ) 

5^~ l) 

dC 

dW& 

dC 

dW) 


<t>(wW T xv-v -B& 

(aV^ — z) © (f) {s^) 

(W {i) S^) © 

xV-VsW T 


= -S^\ 


(6.2.29) 

(6.2.30) 

(6.2.31) 

(6.2.32) 

(6.2.33) 


It is worth noting that the right side of (6.2.32) is a multiplication of two 
matrices of vector type. More precisely, if u — (u\ ... u n ) T and v = (y\ ... v n ) T 
are two column vectors in R n , then uv T — (uiVj)ij is an n x n matrix. This 
should not be confused to u T v = E* i , which is a number. We also note 
that in (6.2.31) the matrix does not have a transpose. 


Remark 6.2.1 In the case of the linear neuron the previous formulae have a 
much simpler form. Since the activation function is <j){x) — x, the backprop- 
agation formula for the delta becomes Iterating yields 

= Vb (m) IU^ +2) ... W^S^ L \ 


Using (6.2.32) and (6.2.33) this implies 



(w (e+i) ...w^s^y 


-W (e+1) ...W (l) 5 (l \ 


formulas which provide a closed-form expression for the gradient of the cost 
function, VC= 








Neural Networks 


185 


Remark 6.2.2 Formula (6.2.30) is based on the particular assumption that 
the cost function is a sum of squares. If the cost function changes, only the 
formula for 8^ should be adjusted accordingly. This is based on the fact that 
deitas change backwards from l = L to l = 1, see (6.2.31). In the following 
we shall provide formulas for 8^ for a couple of useful cost functions. 


Example 6.2.3 Assume the cost function is given by the cross-entropy 

C = lnrcjy = -^2z k ln<p(s[ L) ), 

k k 


which represents the inefficiency of using the predicted output, x^ L \ rather 
than the true data z. The delta for the jth neuron in the output layer is now 
given by 





XA w(4 L) ) 

k 



This expression can be further simplihed if we consider the activation function 
to be the logistic function, (j){x) — cr(x). In this case, the computation can be 
continued as 



Using the Hadamard product, this writes more compactly as 

8^ = -z © (l - cr(s(L))), 


and can be regarded as a replacement for the relation (6.2.30) in the case 
when the cost function is given by the cross-entropy. 


Example 6.2.4 Another cost function, which can be used in the case when 
the activation function (of the last layer) is the logistic function, <f>(x) — cr(x), 
is the regular cross-entropy function , x where G {0,1}. 

c = ln 4 L) ~ 5F 1 - z 0W i - 4U (6.2.34) 

k k 


1 This is the sum of binary entropy functions, see Exercise 12.13.15. 








186 


Deep Learning Architectures, A Mathematical Approach 


In this case the deitas of the output layer are computed as 



dC 


d 


ds) 


~ z j 


( L ) 




(L) 


(J2 zk ln cj (4 L) ) + Jh 1 - z k) M 1 - ^(4 L) )) 

k k 




+ (1 - Zj ) 


-Zj( 1 - cr(sJ L) )) + (1 - Zj)(j(s)" J ) 


<Asf) 

l-aisf 1 ) 

(.L) 


J 


J 


- Zj = x)"’ - Zj 


(L) 


J 


J 


where we have used the useful property a\x) — cr(x)(l — a(x)). Hence, in 
this case the deitas are equal to the differences between the model outcomes 
and target valnes, 5^ — x^ — z. 


6.2.6 Gradient descent algorithm 

Consider a neural network with the initial System of weights and biasses 
given by 0) and (0). The weights can be initialized efficiently by 
the procedure described in section 6.3. Let 77 > 0 be the learning rate. The 
optimum valnes of weights and biasses after the network is trained with the 
gradient descent algorithm are given by the approximation sequences defined 
recursively by 


w if (n+ 1) = wff (n) - r/Sp {n)xf ^ (n) 

bP {n +1) = \)p (n) + r jSP(n), 


where the outputs xP(n) and deitas $P (n) depend of the weights, wP(n) 


and biasses, bP(n), obtained at the nth step. We note that when writing the 




above equations we have used the aforementioned formulas for 


dC 


dw 


(0 

ij 


and 


dC 


db 3 


(0 • 


6.2.7 Vanishing gradient problem 

Assume we have a neural network where all neurons have the activation 
function given by the logistic fnnction a. If a signal is too large in absolute 
value, then the logistic fnnction saturates, approaching either 0 or 1 , fact 
that, via formula (6.2.23), implies that bf ^ is equal, or close, to zero. 

On the other side, since <j' — cr(l — cr), it is easy to show that 0 < <j' < 
1/4. This implies via (6.2.23) that backpropagation through a sigmoid layer 
reduces the gradient by a factor of at least 4. After propagating through sev- 
eral layers, see formulae (6.2.16) and (6.2.18), the resulting gradient becomes 








Neural Networks 


187 


very small, fact that implies a slow, or a stopped learning. This phenomenon 
is called the vanishing gradient problem. 

From the geometrical point of view, this means that the cost surface 
(w,b) —> C(w,b) has flat regions, called plateaus. The plain-vanilla gradient 
descent method does not perform well on plateaus. In order to get the iter- 
ation out of a plateau, the method needs to be adjusted with a momentum 
term or a stochastic perturbation, see section 4.13. Another way to minimize 
the vanishing gradient problem is to use activation functions which do not 
saturate. Improved results can be obtained sometimes by using ReLU, or 
other hockey-type activation functions. 

To conclude, the neuron learning process in the gradient descent method 
is affected by two ingredients: the learning rate, 77 , and the size of the cost 
function gradient, VC. If the learning rate is too large, the algorithm will not 
converge, the approximating sequence being oscillatory. On the other side, 
if it is too small, then the algorithm will take a long time to converge and 
might also get stuck in a point of local minimum. 

The size of the gradient, VC, depends on two things: the type of activa¬ 
tion function used, 0, and the choice of the cost function, C. The gradient 
VC depends on deitas 5^\ which depend on the derivative of the learn¬ 
ing function, <//. Given the aforementioned saturat ion tendency of sigmoidal 
functions, it is recommended to used ReLU activation function for deep feed- 
forward neural networks. 

The choice of the cost function can also affect the learning efficiency. In 
the next table we provide three cost functions and their deitas in the output 
layers, dj L \ which will be backpropagated to obtain the other deitas, tfp. 
The formulas are written for a sigmoid neuron, i.e., (j) — a. 


Cost formula 

Deitas 

1|| (L) _ 2 

2 11 ^ 

-x^ in 4 L) 

- X z k InaU - (1 - z k ) ln(l - x[ L) ) 

d j L) = (C - z j) a '( s j L) ) 

5f ] = -Zj{\ - <j{sf ] )) 

4 L) = N - z i 


The first line, which corresponds to the quadratic cost, has a delta in the 
output layer that depends on cd, which by backpropagation influences the 
vanishing gradient tendency. 

The second line corresponds to the cross-entropy function and has a term 


n 

This has the semantic interpretation that targets that take on the value zero are unob- 
servable. 








188 


Deep Learning Architectures, A Mathematical Approach 


(1 — cr) which may lead to vanishing gradient behavior, but not as much as 
in the case of the quadratic error. 

The last line shows that the symmetric entropy delta is independent of 
the activation function, and hence, this produces the least vanishing gradient 
effect among all aforementioned cost fnnctions. 

There is one more factor that influences learning efficiency. This is the 
weights initialization. We shall deal with this important problem in section 
6.3. 

6.2.8 Batch training 

Assume you are in an unknown city looking for the Science Museum. The 
only information you can get are directives from the people you meet in the 
Street. It would be ideal to obtain directives from all people in town and then 
average their opinion, but this procedure is time consuming and hence not 
always feasible, if the town is large. Therefore, you are left with the options 
of asking one, or several individuals about the correct direction. 

If you ask one individual “Where is the Science Museum?”, you are given 
a directive, which may or may not be fully trustfully. However, you take the 
chance to go a certain distance in the suggested direction. Then ask again 
and walk another distance into the new suggested direction. This way, you 
are getting closer to the museum and eventually getting there. 

There is another way to navigate toward the museum. Ask a group of 
100 tourists about the museum direction and average out their suggested 
directions. This way, being more confident, you will walk for a larger distance 
into the average direction. Then repeat the procedure by asking again. In the 
latter case you expect to get to the museum sooner than in the former, since 
the learning rate is larger and you move into a more trustful direction. 

A similar situation occurs when the gradient descent method is used. Each 
input produces a gradient direction VC(A^ 0 ^). Using full input data to 
compute the gradient of the cost function would provide the best resuit, 
VC^X^ 0 )), but this procedure has the disadvantage that is time expensive. 
Therefore, a trade-off between accuracy and computation time is made by 
sampling a random minibatch and computing an estimation of the gradient 
from it. If consider a minibatch of N inputs, ..., X( 0,A ^}, the average 

direction is 

~ 1 N 

VC= — ^VC(X (0 ’ fc) ). 

v k=1 

By the Central Limit Theorem averages tend to have a smaller vari- 
ance than each individual outcome. More precisely, if the inputs are indepen¬ 
dent random variables, then the average gradient VC has the same mean as 


Neural Networks 


189 



Figure 6.6: a. Updating weights in the case of a small batch requires a large 
number of steps. b. Updating weights in the case of a large batch needs fewer 
steps. 


and the variance N times smaller. For instance, if the batch size 
is N = 100, the error of the mean direction VC is one digit more accurate 
than the raw direction VC. Since the error is smaller, we are more confident 
in proceeding into this direction for hnding the minimum, and hence we may 
choose a larger learning rate, at least in the beginning. 

The procedure of updating the network parameters using the gradient 
descent method with a gradient estimated as an average of gradients mea- 
sured on randomly selected training samples is called the stochastic gradient 
descent method. 

It is worth noting the trade-off between the size of the batch and the 
training time, see Fig. 6.6. The larger the size of the batch, the more accurate 
the direction of the gradient, but the longer it takes to train. The fewer data 
fed into a batch, the faster the training but the less representative the resuit 
is. For instance, in the case of the MNIST data, the usual batch size used in 
practice is about 30. 


6.2.9 Definitiori of FNN 

After working with feedforward neural networks at a heuristic level, we shall 
provide in this section, for the sake of mathematical completeness, a formal 
defmition of this concept. 

First, we shall review all the ingredients used so far in the constrnction. 
We have seen that a feedforward neural network contains neurons arranged 
in L + 1 distinet layers. The hrst layer corresponding to the input data is 


190 


Deep Learning Architectures, A Mathematical Approach 


obtained for £ = 0 and the output layer is realized for £ — L. The jth neuron 
in layer £ has the activation x^\ For the input layer, the activations come 
from the input data, while for the other layers the activations are computed 
in the terms of the previous layer activations by the forward pass formula 


T W _ 



E 


W IJ X l 


Here, is the activation fnnction and and are the weights and 
biasses of the neurons in the £th layer. We also note that = JT wp xf ^ — 
bP is an affine function of xf ^ and the activation xp is obtained by apply- 
ing the function <f on 

Now we pursue a higher level of abstraction. We shall denote by J~{U) — 
{f:U —> R} the set of real-valned fnnctions defined on the set U. Consider 
the set of indices Ut — {1,2,..., d^}. Implement the affine function 


oli ' F(Ui- 1 ) F{Ui) 


by at(x^~^) — s^\ Then x^ — ( c(o at)(x^~^) is obtained composing 
the affine function on the previous layer activation with the nonlinearity <fi. 
The formal definition of a feedforward neural network is as in the following. 


Definitiori 6.2.5 Let Ut — {1, 2,..., d^}, 0 < £ < L, and consider the 
sequence of affine functions oq,..., oll 

at : Tffit- 1 ) —^ F(Ut) 

and the sequence of activation functions (jM^ : R —> R. Then the corresponding 
feedforward neural network is the sequence of maps /o, /i,..., /m where 

fl = <p t) oa t o f t _ x , 1<£<L, 


with /o given. 


Hence, a deep feedforward neural network produces a sequence of pro- 
gressively more abstract reparametrizations, /g, by mapping the input, /o, 
through a series of parametrized functions, at, and nonlinear activation func¬ 
tions, cj)^\ The network’s output is given by 

fx — <j)^ o aL o o aL-i o • • • o o aq. 

The number L is referred as the depth of the network and maxjd^ 1 ),..., } 

as the width. In the case of regression problems the last activation function 
is linear, <f^ L \x) — x. For classffication problems, the the last activation 
function is a softmax, cj)^ L \x) — softmax(x ), while in the case of logistic 
regression the activation function is a logistic sigmoid, <f^ L \x) — cr(x). 



Neural Networks 


191 


6.3 Weights Initialization 


In the case of networks with a small number of layers initializing all weights 
and biasses to zero, or sample them from a uniform or a Gaussian distribution 
of zero mean, usually provides satisfactory enough convergence results. How- 
ever, in the case of deep neural networks, a correct initialization of weights 
makes a significant difference in the way the convergence of the optimality 
algorithm works. 

The magnitude of weights plays an important role in avoiding as mnch 
as possible the vanishing and exploding gradient problems. This will be 
described in the following two marginal situations. 

(i) If the weights are too close to zero, then the variance of the input 
signal decreases as it passes through each layer of the network. 

This heuristic statement is based on the following computation. Let 
denote the output of the jth neuron in the £th layer. The outputs between 
two consecutive layers are given by the forward propagation equation 

xf = x f~ l) - h f)- 


Using the Appendix formula (D.4.1) 


Var(f(X )) « f'(E[X]) 2 Var(X) 


the aforementioned equation provides the variance of neuron activations in the 
£th layer in terms of variances of neuron activations in the previous layer as 

Var{xf ] ) « </>'( W i? E \. X f~ l) \ ~ b f) 2 ( Y W f )var ( x i £ A) • (6-3.35) 


By Cauchy’s inequality, we have 


J 2 wfVar(xt 1] ) < (E(^f ) 2 ) 1/2 (E Var ( x t 1] ) 2 ) 1/2 


Making the assumption that the activation function has a bounded derivative. 
(0 / ) 2 < c, then taking the sum over j and using Cauchy’s inequality yields 


Y.Varixff <C 2 E(wf ) 2 Y Va < X f 1} ) 2 


J 


rj 


This shows that for small weights the sum of squares of variances in the 
^th layer is smaller than the sum of squares of variances in the (£— l)th layer. 
Iterating the previous inequality, it follows that the signahs variance decreases 





192 


Deep Learning Architectures, A Mathematical Approach 


to zero after passing through a certain number of layers. Therefore, after 
passing through a few layers, the signal becomes too low to be significant. 

(ii) If the weights are too large, then either the variance of the signal 
tends to get amplified as it passes through the network layers, or the network 
approaches a vanishing gradient problem. 

If the weights are large, the sum xf ^ also gets large. If the 

activation function is linear, (f>(x) — x, then 

Var(xf) = Y J W ^Var(xf~ 1) ), 


which implies that Var(X^) is large. 

If the activation function is of sigmoid type, then when the weights 

are large the sum Xa xf ^ tends to have large values, and hence the 
activation function <f> becomes saturated, fact that leads to an approaching 
zero gradient problem. 


Hence, neither choosing the weights too close to zero, or choosing them 
far too large is a feasible initialization, since in both cases the initialization is 
outside the right basin of attraction of the optimization procedure. We need 
to initialize the weights with values in a reasonable range before we start 
training the network. We shall deal next with the problem of hnding this 
reasonable range. 


The main idea is based on the fact that the propagation of the signal 
error through a deep neural network can be quantized by the signal variance, 
namely the variance of the neuron activations. In order to keep the variance 
under control, away from exploding or vanishing values, the simplest way is 
to find the weights values for which the variance remains roughly unchanged 
as the signal passes through each layer. Since assigning the initial weights is 
a random process, we assume the weights to be either uniform- or Gaussian- 
distributed random variables with zero mean. 


To make notations simpler, after we hx a layer l, we denote the incoming 
and outgoing vector signals from that layer by X — X^~ 1 ^ and Y — iw, 
respectively. The number of the neurons into the layer £ will be denoted in 
this section by N — d^~ l \ The jth component of Y is given by 


i 


where W{j are random variables with E[Wj] = 0, and X{ are indepen- 
dent, identically distributed random variables, with zero mean (for instance, 



Neural Networks 


193 


Gaussian distributed with zero mean). It makes sense to assume that Wij 
and Xi are independent. Also, the bias bj is a constant (usually initialized 
to zero). Using the Appendix formula (D.4.1), the variance of Yj is approxi¬ 
mat ely given by 


Var{Yj ) 


2 

<t>' ( 51 HWijXi] + 6,) Var ( ^ 

i i 

^{bjfY^VariWijXi) 

i 

/fe) ! E Var(Wij)Var(Xi) 

i 

N(f)\0) 2 Var(Wij)Var(Xi), 


where in the second identity we used E[W^JQ] = E[W^-]E[AQ] = 0 and the 
additivity of variance with respect to independent random variables. In the 
third identity we used the multiplicative property of variance as given by 
Goodman’s formula, see Lemma D.4.1 from Appendix. In the last identity 
we used that both weights Wij and inputs X{ are identically distributed and 
the number of incoming neurons is N. We also initialized here the bias bj — 0. 

Asking the condition that the variance is invariant under the £th layer, 
i.e., VariYj ) = Var[X {) , we obtain 

N^OfVariWijVariXi) = Var(Xi). 


Simplifying by Var(Xi ), we obtain the variance of the weights 

Var (Wij) = 1 


N<j/(0) 


2 ' 


(6.3.36) 


We encounter two useful cases: 

w<y(o) 2 )- 

the case of a linear activation, (j) (0) — 1 , we have Wij v(o,y>. 

2. If Wij are drawn from a uniform distribution on [—a, a], with zero mean, 


1. If Wij are drawn from a normal distribution, then Wij V(0 


Wij ~ Unif[— a, a], then equating their variances. 


a A 


Vs 


Vs 


N<f)'(oy' 


yields 


“ = He,lce ' w,s ~ Unif L f(0)^ mVNi ■ ^ 

It is worth noting how the activation function nonlinearity at 0 influences 
the range of the initialization interval. The larger the slope at zero, the nar- 
rower the interval. One way to use this for obtaining better results is to set 
<//(0) to be a new hyperparameter and tune it on a validation set. 

Xavier initialization These weight initializations have taken into account 
only the number of the input neurons, IV, into the layer £, using a forward 















194 


Deep Learning Architectures, A Mathematical Approach 


propagation point of view described by the invariance relation Var(X^) — 
Uar(X^ +1 )). For preserving the backpropagated signal as well, Glorot and 
Bengio [43] worked a formula involving also the outgoing number of neurons, 
called the Xavier initialization. This involves a backpropagation point of 
view, which is based on the assumption that the variances of the cost function 
gradient remain unchanged as the signal backpropagates through the £th 
layer, i.e., 


Var 


dC 


L dW>- 


(t-i)J 


— Var 


dC 


L dW>- 

13 


wu 


Using the chain rule eqnation (6.2.27) the previous equation becomes 


Var 


2) 


— Var 


sfxf X) 


■w 


where S'-'’ = V, . In order to make progress with the computation we shall 


3 


ds 


(0 


assume that the activation function is linear, (f)(x) — x , i.e., the neuron is of 
linear type. Under this condition the equation (6.2.26) is written as 

N' N' 

st l) = V sfnf = Vifwf, (6.3.37) 


3 = 1 


3 = 1 


where N' — is the number of neurons in the £th layer. Since and 
are independent, see Exercise 6.6.7, then we have 


(t) 


E 


N' 

ys w w {£) 

3 13 

3 = 1 


N' 


ynsyEiwy = o ; 

3 = 1 


where we used that E [W^] — 0. Therefore, E [df = 0, and similarly, 
E [5^] = 0. Assuming that K[xf 2 ^] = ~K[xf = 0, and using the indepen- 

dence of and xf l \ see Exercise 6.6.7, then Goodman’s formula (Lemma 
D.4.1) yields 

Var [Sf-V] Var [xf~ 2) ] = Var [sf] Var [xf~ iy . 

Using the forward propagation relation Var\xf ^ = Var\xf ^ , it fol- 
lows that the deitas variance remains unchanged 

Var[df~ l) ] = Var[df]. 

Applying the variance to equation (6.3.37), using the independence and 
Goodman’s formula, we obtain 

N' 

Var(5f~ 1] ) =J2Var(Sf)Var(W l f) = N'Var(5f)Var(W t f), 
























Neural Networks 


195 


where we also used the fact that the weights and deitas are identically 
distributed. Dividing the last two equations, we arrive at the relation 
N'Var(W^p) — 1, or equivalently 


Var(W t f) = k (6.3.38) 

i.e., the variance of the weights in the £th layer is inversely proportional to 
the number of neurons in that layer. Under the assumption of the linear 
activation function, formula (6.3.36) becomes 

Var(wV) = ± (6.3.39) 

where N is the number of neurons in the (t — l)th layer. 

Now, the equations (6.3.38) and (6.3.39) are satisfied simnltaneously only 
in the case N — N ', i.e., when the number of neurons is the same in any two 
consecutive layers. Since this condition is too restrictive, a lucrative compro- 
mise is to take the arithmetic average of the two, the case in which 

Var(W<f>) = yV (6 ' 3 ' 40) 


where N — ^ and N' — d^\ 

Again, we emphasize two cases of practical importance: 

1. If Wij are drawn from a normal distribution, then Wij V(0 


’ N+N' 


)• 


2. If Wij are drawn from a uniform distribution with zero mean, then Wij ~ 


Unif 


Vq 


Vq 


y/N + N' ’ VWTWi 


which is also known as the normalized ini- 


tialization. 


In conclusion, a reasonable initialization of weights, even it might not 
totally solve the vanishing and exploding gradient problems, it stili improves 
signihcantly the backpropagation algorithm functionality. 


6.4 Strong and Weak Priors 

In the context of neural networks, a prior is a probability distribution over 
the parameters (weights and biasses) that encodes our initial beliefs before 
we have seen the data. 

A prior is called weak if its entropy is high, i.e., if the initial belief about 
the parameters distribution is not very specific. A prior is called strong if its 
entropy is small, i.e., if the initial belief about parameters distribution is very 
specific. 











196 


Deep Learning Architectures, A Mathematical Approach 


This section will discuss the previous initializations in the aforementioned 
context. First, we recall from Chapter 12 that the entropy of a probability 
distribution p is given by 




We shall consider only those priors p with positive entropy only. 
(i) In the case when Wij ~ Unif 1 ^ 


cise 6.6.10 we have the entropy 


4/(0 )Vn’ <f/(0)y/Nl 


then by Exer- 


H(p w ) = ln 


2\/3 


<f/(o)VN 


where p w denotes the probability density of the weights Wij. Consequently, 
the prior becomes strong if the number of neurons N is large. 

(ii) In the case of Xavier initialization, W{j -V(0, jpjjr)' Then by 

Exercise 6.6.11 we have 


H(p w ) = ln 


2-y/e 

\/N + N '' 


Hence, a large number of incoming and outgoing neurons makes the prior to 
be strong. 


6.5 Summary 

A neural network is obtained by a concatenation of several layers of neurons. 
The hrst layer corresponds to the input while the last one to the output. The 
inner layers are called hidden layers. This chapter was concerned only with 
the feedforward neural networks. These are networks where the informat ion 
flows forward, from the input to the output. 

Neural networks are used for learning complicated features, which a sin- 
gle neuron can’t do. They are trained by the gradient descent method. An 
important ingredient of this method is to compute the gradient of the cost 
function. The gradient is computed by a recursive method called backprop- 
agation. The master eqnations for both forward and backward propagation 
are provided in detail. 

Sometimes it is more lucrative to compute the gradient using an average 
over a training minibatch, which leads to the stochastic gradient descent 
method. The optimal size of the minibatch depends on the problem and can 
be considered as a trade-off between the training time and obtained accuracy. 













Neural Networks 


197 



-l 



a b 

Figure 6.7: a. Neural network used in Exercise 6.6.3. b. For Exercise 6.6.6. 


For the gradient descent method to work properly certain initialization 
for the weights and biases has to be done. If the network is shallow, usually 
a zero initialization is preferred. For deep networks there are some preferred 
initializations such as Xavier initialization, or the normalized initialization. 


6.6 Exercises 

Exercise 6.6.1 Draw the multi-perceptron neural network given by Remark 

6 . 1 . 2 . 

Exercise 6.6.2 (a) Can a single perceptron learn the following mapping: 

( 0 , 0 , 0 ) 1 , ( 1 , 0 , 0 ) — > 0 , ( 0 , 1 , 0 ) 0 , ( 0 , 0 , 1 ) — > 0 , 

( 0 , 1 , 1 ) 1 , ( 1 , 1 , 0 ) — > 0 , ( 1 , 0 , 1 ) — > 0 , ( 1 , 1 , 1 ) 1 ? 

(6) Draw a multi-perceptron neural network that learns the previous map¬ 
ping. State the weights and biases of all perceptrons in the network. 


Exercise 6.6.3 Write an explicit formula for the output of the multi-perceptron 
neural network given in Fig. 6.7 a. 

Exercise 6.6.4 Consider a sigmoid neuron with one-dimensional input x, 
weight re, bias 6, and output y — a(wx + b). The target is the one-dimensional 
variable z. Consider the cost function C{w,b) — \{y — z) 2 ; 

(a) Find VC(w , b ) and show that || VC|| < + x 2 (1 + \z\); 

(■ b ) Write the gradient decent iteration for the sequence of weights (w n , b n ). 








198 


Deep Learning Architectures, A Mathematical Approach 


Exercise 6.6.5 (noise reduction) Consider the neural network given by 
Fig. 6.2. 

(a) Show that Ve > 0, there is rj > 0 such that if 11 dX \| < rj then ||dy|| < e; 

( b ) Assume the input X is noisy. Discuss the effect of weights regularization 
on noise removal; 

(c) Discuss the choice of the activation function in the noise removal. Which 
choice will denoise better: the logistic sigmoid or the tangent hyperbolic? 


Exercise 6.6.6 Consider the neural network given by Fig. 6.7 b, with the 
cost function C{w, b) — — z) 2 . 

(a) Compute the deitas 5^ = 7 ^-, i G {1,2}; 

(■ b ) Use backpropagation to hnd the gradient X7C(w,b). 

Exercise 6.6.7 Assume the activation function in the £th layer is linear and 
the weights and inputs are random variables. 

(a) Show that and are independent random variables; 

(b) Show that 6^ and xf ^ are independent random variables. 

Exercise 6.6.8 Consider a sigmoid neuron with a random input normally 
distributed, X ~ J\f{ 0,1), and the output Y — cr(wX+b). Show that VarY zz 
w 2 a'(b) 2 . Note that the output variance decreases for small weights, w, or 
large values of the bias, b. 


Exercise 6.6.9 Consider a one-hidden layer neural network with sigmoid 
neurons in the hidden layer. Given that the input is normally distributed, 
X ~ J\f( 0,1), and the output is Y — i OLi(j(wiX + 6^), show that VarY zz 




Exercise 6.6.10 Let p(x) be the uniform distribution over the interval [a, b]. 
Show that H(p) — ln(6 — a). 


Exercise 6.6.11 Let p(x) be the one-dimensional normal distribution with 
mean fi and Standard deviation a. Show that its entropy is H(p) — ln(cr\/27re). 


Exercise 6.6.12 State a formula relating the batch size, iV, the current iter- 
ation, k and the number of epochs, p, during a neural network training. 







Part II 
Analytic Theory 



® 

Check for 
updates 


Chapter 7 

Approximation Theorems 


This chapter presents a few classical real analysis results with applications to 
learning continuous, integrable, or square-integrable functions. The approxi¬ 
mation results included in this chapter contain Dini’s theorem, Arzela-Ascoli’s 
theorem, Stone-Weierstrass theorem, Wiener’s Tauberian theorem, and the 
contraction principle. Some of their applications to learning will be provided 
within this chapter, while others will be given in later chapters. 


7.1 Dini’s Theorem 


Learning a continuous function on a compact interval resumes to showing 
that the neural network produces a sequence of continuous functions, which 
converges uniformly to the continuous target function. In most cases, it is not 
difficult to construet outcomes that converge pointwise to the target function. 
However, showing the uniform convergence can be sometimes a complicated 
task. In some circumstances, this can be simplibed by the use of certain 
approximations results, such as the following. 


Theorem 7.1.1 (Dini) Let f n : [a, b] -A R. be a sequence of continuous 
functions. 

0 i ) If fn+ 1 < fn, f or att n>l, and f n (x) -A 0, for any x G [a, 6], then f 
converges uniformly to 0 on [a, b\. 

(ii) Let g G C[a, b\ such that f n ( x ) \ d( x ) f or an U x ^ [a, b\. Then f 
converges uniformly to g on [a, b\. 

Proof: (i) Let e > 0 be arbitrary fixed and consider the sets 

A n = /V 1 ([ e > °°)) = {x E [a,b\;f n (x ) > e}, 


n 


n 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_7 


201 










202 


Deep Learning Architectures, A Mathematical Approach 



Figure 7.1: The descending sequence of compact sets A n . 


see Fig. 7.1. Since f n are continuous functions, then A n are closed. Further- 
more, the sets A n are compact, as closed subsets of a compact set, A n C [a, b\. 
The sequence ( A n ) n is descending, i.e., A n+ i C A n , Mn > 1. To show this, 
let x G A n + 1 be arbitrary chosen. Since f n (x) > / n + i(x) > e, then x G A n , 
which implies the desired inclnsion. 

Assume now that A n ^ 0, for all n > 1. Then fln>l An 7 ^ 0, as an 
intersection of a descending sequence of nonempty compact sets. 1 Let xq G 
f) n>1 A n . Then f n (x o) > e, for all n > 1, which contradicts the fact that 
f n (x o) -G 0, as n —>► oo. Hence, the assumption made is false, that is, there is 
an N > 1 such that Ajy — 0. This means < e, Mx G [a, 6]. And using 

that (/ n ) n is decreasing, it follows that f n (x) < e, Vn > N and \/x G [a, 6]. 
This can be written as max xG [ a ^ |/ n | -G 0, as n —> oo, namely, f n converges 
uniformly to 0. 


(ii) It follows from part (i) applied to the sequence (f n — g ) 


n 


Remark 7.1.2 The resuit stili holds if the compact interval [a, b] is replaced 
with a compact complete metric space (5, d), and the convergence is consid- 
ered in the sense of the metric d. 


Example 7.1.1 1. Consider a continuous function g : [0,1] R and the 
partition Ai : 0 = xo < x\ < • • • < x n = 1, such that {xi)i are inflection 
points for g, that is, g\[ Xi ^ Xi+1 \ is either convex or concave, see Fig. 7.2. Denote 
by /i the piecewise linear function obtained by joining the points (xi,g(xi)). 


1 This resuit is known as the Cantor’s lemma. 

















Approximat ion Theorems 


203 



Figure 7.2: A polygonal approximating sequence (f n )n satisfying the inequality 
\g(x) - fn+ iO)| < | g{x) - fn(x) I, for all x e [0,1]. 


Now, consider the partition A 2 obtained by enriching the partition Ai with 
the segment midpoints (xi + Xi+ 1)/2. Denote the new obtained piecewise 
linear function by / 2 . We continue the procedure, associating the fnnction f n 
to the partition A n . The sequence of continuous fnnctions (f n ) n satisfies the 
property 


I g(x) - f n +i(x)\ < \g(x) - 


Dinrs Theorem applied to the sequence 
g , uniformly on [0,1]. 


fn(x) 


Mx G [0, 1]. 


g — f n \ iniplies that f n converges to 


7.2 Arzela-Ascoli’s Theorem 

Another resuit of constructing a uniformly convergent sequence is to extract 
from an existing family of continuous functions a subsequence with this prop¬ 
erty. The Arzela-Ascoli theorem provides an existence resuit, rather than a 
constructive procedure, which is useful in theoretical investigations. First, we 
shall introduce a few defmitions. 

A family of functions T on a set A is called uniformly bounded if there is 
an M > 0 such that 


1 / 0)1 < M, 


Vx e A, V/ e T. 


This means that the functions in the family T are all bounded by the same 
bound M . 












204 


Deep Learning Architectures, A Mathematical Approach 


Example 7.2.1 Let T — {cos(ax + 6); a, b G R}. Then the family T is uni- 
formly bounded, with M — 1. 


Example 7.2.2 Consider 


7V 


7V 




(E OLjo(wjX + bj ); oq, Wj , bj G R, I2 a i ^ 1 

.7 = 1 .7 = 1 


where cr(x) is a sigmoid function satisfying |cr(x)| < 1 (such as the logistic 
function or the hyperbolic tangent). Then the family T is uniformly bounded, 
with M — y/N. This follows from an application of Cauchy’s inequality 

N N N 

52 a M w J x + 6 i) - 52 “i 52 a ( w i x + h i? < 1 -N = N. 

.7 = 1 .7 = 1 .7 = 1 

A family of functions T on a set A is called equicontinuous if Ve > 0, 
there is r/ > 0 such that V/ G J~ we have 


/(x) — /(V)| < e, Mx,x G A, with 


/T» _ /T» 

«X/ «X/ 


< Tj. 


Eqnivalently stated, the functions in the family T are all uniformly continu- 
ous with the same e and rj. 

Example 7.2.3 Consider a family of continuous differentiable functions T C 
C l [a , &], such that there is a constant L > 0 with the property 


sup 

xe[a,b] 


f(x) | < L 


v/gx. 


Then T is equicontinuous. 

To show this, using the Mean Value Theorem, we obtain the following esti- 
mation: 


\f(x)-f(x')\< sup I f(x) 

xe[a,b] 

and then choose rj — e/L. 


_ ry» 

«X/ 


< L 


ry» _ ry» 

«X/ 


V/eX, 


Example 7.2.4 Consider the family of functions 


N N 

T — \ aj<j(wjX + bj ); oq, Wj , bj G R, < 1 

.7 = 1 .7 = 1 


where <r stands for the logistic sigmoid. The family can be interpreted as out- 
puts of a one-hidden layer neural network with exactly N units in the hidden 



Approximat ion Theorems 


205 


layer. The weights constraint ||<a|| 2 + ||ic|| 2 < 1 States that the weights are 
kept small, which is one of the ideas of regularization. Under this hypothesis, 
the family T is equicontinuous. 

This can be shown as in the following. The maximum slope of cr(x) is 
achieved at x — 0 and it is equal to 


ma xa\x) — maxcr(x)(l — <j{x)) — cr(0)(l — cr(0)) — —. 


Then for any / E J 7 , applying Cauchy’s inequality, we have the derivative 
estimation 


I f\x) 


N 


N 


T ajWja^WjX + bj ) < - Yi 


3 = 1 


\ a 3 


3 = 1 


a 


3 = 1 


N \ 1/2 / N 

**'Y. 


w 


3 = 1 


?\ 1 / 2 1 
2 ' < -. 
“ 4 


3 


w 0 


Since sup \f(x)\ < V/ G using Example 7.2.3, it follows that the 


x£[a,b} 


4 


family T is equicontinuous on [a, b\. It is worth noting that the conclnsion 
holds in the more general setup of a sigmoid function a satisfying \<j'(x)\ < 
A < 1. For instance, a(x) = tanh(cx), or a(x) = sin (cx + b ), with \c\ < X < 1 
and hxed b G R, satisfy the condition. 


The proof of the following important resuit can be found in any advanced 
book of Real Analysis, such as [104]. 

Theorem 7.2.1 (Arzela-Ascoli) Let T C C[a,b\ be a family of continuous 
functions on [a, b\. Then the following statements are equivalent: 

(a) Any sequence V(/ n ) n C J~ contains a subsequence ( fn k )k that is uniform 
convergent; 

( b ) The family T is equicontinuous and uniformly bounded. 


We make the remark that condition (a) is equivalent to the fact that J 7 
is a compact set in the metric space (C[a, 6], || • ||oo) 5 see Appendix A for 
defmitions. 

In the following we state an existence resuit of an unsupervised learning. A 
three-layer neural net has capabilities of “potential learning” some continuous 
function, given constraint conditions on the size of the hidden layer, slope of 
the activation function, and magnitude of weights. 


Theorem 7.2.2 Let N > 1 be afixed integer and consider a one-hidden layer 
neural network such that: 














206 


Deep Learning Architectures, A Mathematical Approach 


(i) the network input is a real bounded variable, x G [a, b\; 

(ii) the output is a one-dimensional neuron with linear activation function 
and zero bias; 

(iii) there are N neurons in the hidden layer with a differentiable activation 
function a such that \a'\ < X < 1; 

(iv) the weights satisfy the regularization condition ^ vehere 

Wj are the input to hidden layer weights and eij are the hidden layer to output 
weights; 

Then there is a continuous function g on [a, b} that can be approximately 
represented by the network. 


Proof: The output function of the network is given by the sum 


N 

fa,w,b ( x ) = E a j a ( w j x + b j )> 

3 = 1 


where Wj are the weights from the input to the hidden layer, and eij are 
the weights from the hidden layer to the output. The biasses are denoted 
by bj. Given the regularization constraint satished by the weights, Exam- 
ple 7.2.2 implies that the family T — {f a ,wp w -> b €= R^} is uniformly 
bounded. Example 7.2.4 shows that T is equicontinuous. Then applying the 
Arzela-Ascoli Theorem, it follows that any sequence of outcomes, ( f n )n C J 7 , 
given by 

N 

fn(x) = Oij(n)a(wj(n)x + bj(n)), n > 1 
3 = 1 


has a subsequence ( fn k )k 5 which is uniformly convergent on [a, 6 ]; its limit is 
a continuous function, g G C[a, b]. 


Remark 7.2.3 In the previous resuit the continuous function g is not known 
a priori. The neural network learns “potentially” a continuous function, in the 
sense that among any infinite sequence of network trials, it can be extracted 
a subsequence along which the network approximates some continuous func¬ 
tion. This is an existence resuit; hnding an algorithm of constructing the 
convergent subsequence is a different problem and will be treated in a differ¬ 
ent chapter. 


Example 7.2.5 (Sigmoid neuron) Consider a sigmoid neuron with the 
output 

fw,b( x ) = v{w T x + b ), 


where a is the logistic function, with weights w G R n , bias b G R, and input 
x G I n = [ 0 , 1 ] x • • • x [0,1]. Assume the weights are bounded, with \\w\\ < 1. 







Approximat ion Theorems 


207 


Then the family of continuous functions 


r = {f w ,b\ 


W 


< 1,6gR} 


is uniformly bounded (since \f W: b( x )\ < 1) and equicontinuous, because 


| fw,b(% ) 


<j(w T x + b) — cr{w T x + 6)| 


< 


max 

1 

d 


a 


f j > # j . -, 

w x + b — w x — b || 


T 1 / / \ 1 

w [x — x) < — 

v 7 /1 


w 


rp _ /T» 


< - 


_ rp 


By the Arzela-Ascoli theorem, any sequence of outputs {f Wki b k }k contains 
a subsequence uniformly convergent to a continuous function g on I n . It is 
worth to remark that no conditions have been asked on the bias b. 


7.3 Application to Neural Networks 


The previous notions can be applied to the study of the one-hidden layer 
neural networks with a continuum of neurons in the hidden layer, see also 
section 5.8. 


Assume the input belongs to a compact interval, x E [a, b]. It is well known 
that a one-hidden layer neural net, with N neurons in the hidden layer, has 


an output given by g(x) — Xfyli o-(wjx-\~bj)aj , where a is a logistic sigmoid. 
Also, Wj are the weights between the input and the jth node, and olj is the 
weight between the jth node and the output; bj denotes the bias for the jth 
neuron in the hidden layer. 


We want to generalize to the case where the number of hidden units 
is infinite. In particular, we want to consider the case where the number 
of hidden units is countably infinite. We assume the hidden neurons are 
continuously parametrized by t, taking valnes in the compact [c, d]. Hence, the 
parameter t will replace the index j. Furthermore, the summation weights ay, 
will be replaced by a measure, which under the hypothesis of being absolute 
continuous with respect to the Lebesgue measure will take the form 

h(t)dt. The output from the jth hidden neuron, a(wjx + 6j), is a continuous 
function of x and depends on j. This will be transformed into the kernel 
K{x,t) — <j{w(t)x + b{t)). The sum will become an integral with respect to 
the aforementioned measure. 


Hence, the continuum analog of the output g(x) 


XjLi °( w i x + h i) a 3 


is the integral g{x) 


d 

K(x,t)h(t) dt. If /C is the integral transform with 


J c 

the kernel iF, then g — /C(/i), i.e., the output is the integral transform of 
the measure density h. Assuming the regularization condition \\h \\2 < M, 






























208 


Deep Learning Architectures, A Mathematical Approach 


namely, the measure densities are L 2 -uniformly bounded, then the set of 
outputs is equicontinuous and uniformly bounded, see Exercise 7.11.10. By 
Arzela-Ascoli Theorem, there is a sequence (h n ) n such that the seqnence of 
outputs, g n — JC(h n ) is uniformly convergent on [0,1] to a continnous function 
g. In Chapter 9, it will be shown that the neural network can learn exactly 
any continuous function on [0,1]. 

It is worthy to note that we can consider an even more general case where 

the number of hidden units can be finite, countably infinite, or uncount- 

ably infinite. This can be accomplished by choosing the output g(pc) — 
d 

K(x,t) d/i(£), where the weighting measure p is taken as in section 5.8. 



7.4 Stone-Weierstrass’ Theorem 


In this section we denote by it a compact set in R n . For future applications, 
it suffices to consider K — I n — [ 0,1] x • • • x [0,1], the unit hypercube, and in 
the case n — 1, we may consider K — [a, b\. The set of continnous real-valued 
functions on K is denoted by C(K). The space C(K ) becomes a metric space 
if we set the metric 


d(f,g) = max \f(x) - g{x )|, 

xeK 


Vf,geC(K). 


C(K) can also be considered as a normed linear space with the norm given 
by II/II = m ax xeJ K | f(x) . 

We shall introdnce next a Central concept used in approximation theory 
and learning. The reader can hnd the defmitions of distance and metric space 
in section E.l of the Appendix. 

Definitiori 7.4.1 Let (S,d) be a metric space and A a subset of S. Then A 
is called dense in S if for any element g E S there is a sequence ( f n ) n of 
elements in A such that linp^oo d(f ni g) — 0. 

Equivalently, this means that Ve > 0, there is N > 1 such that 


d(fn,g)<e, Vn > N. 


Example 7.4.2 Let (<S,d) = (R, | |) be the set of real numbers endowed 
with the distance given by the absolute value and A = Q = {|;p, q integers, 
q /— 0}, the set of rational numbers. Then Q is dense in R, namely for any 
x G R, there is a sequence of rational numbers, (r n ) n , which converges to x. 
Equivalently, any real number can be approximated by a rational number: 
given x G R, then Ve > 0 there is r G Q such that \x — r\ < e. 







Approximat ion Theorems 


209 


Example 7.4.3 Consider S — C(K ) endowed with the distance d(f,g) — 
max xe x | /(x) — g(x)\. Then a subset A is dense in C(K ) if for any g G C(K ) 
and any e > o, there is a function / G A such that 


g(x) — /(x) | < e, Vx G if. 


The topological idea of density is that the elements of the subset A are as 
close as possible to the elements of the complementary set S\A. Even if this 
concept is some sort of counterintuitive in the context of general spaces, the 
reader can think of it as in the familiar case of rational and real numbers. 

A subset A of C(K ) is called an algebra if A is closed with respect to 
linear combinations and multiplication with reals: 

(d) V/, g E A=> f + g E A] 

(ii) V/ G A, Vc G M => cf E A] 

(m) \Jf,g E A=> fg E A; 

We note that the hrst two conditions state that A is a linear subspace of 
C(K). 


Example 7.4.1 Let A be the set of all finite Fourier series on [0, 2tt] 

N 

A — | f(x) — cq + cosnx + bk sinnx); a^Ak G M, N = 0,1, 2,... 

k=1 

It is obvious that A is closed with respect to linear combinations. Using 
formulas of transformations of products into sums, such as cosmxcosnx — 
^[cos(m + n)x + cos(m — n)x], it follows that A is also closed to products. 
Hence, A is an algebra of C[0 , 2tt\. 


The algebra A is said to separate points in K , if for any distinet x, V G if , 
there is an / in A such that /(x) ^ f(y ). 

We say that the algebra A contains the constant functions, if for any 
c G M, we have c G A. Given the algebra properties of A , it only suffices to 
show that 1 E A. 


Example 7.4.2 Let A be the set of all polynomials dehned on [a, b 


n 


A — j/(x) = CkX k ; x G [a, &], Cfc G M, n = 0,1, 2,... j. 

k =0 


Then A is an algebra of C[a, b] that separates points. To see this, let x, y G 
a, b} and choose / to be the identity polynomial. Then /(x) — x ^ y — f(y). 
Since 1 G A, it follows that A contains the constant functions, too. 



210 


Deep Learning Architectures, A Mathematical Approach 


The proof of the following approximation theorem can be found, for 
instance, in [104]. 

Theorem 7.4.4 (Stone-Weierstrass) Let K be a compact set in R n and A 

an algebra of continuous real-valued functions on K satisfying the properties: 

(i) A separates the points of K; 

(ii) A contains the constant functions. 

Then A is a dense subset of C(K). 


Example 7.4.3 Let A be the set of ali polynomials defined on [a, b] 


n 


A — j/(x) = CkX k ] x E [a, &], Cfc E R, n = 0, 1, 2,... j. 

k =0 

From Example 7.4.2 A is an algebra of C[a, b} that separates points and con¬ 
tains constant functions. By Stone-Weierstrass theorem, for any continuous 
function g : [a, b} —> R and any e > o, there is a polynomial function / such 
that \g(x) — f(x)\ < e, \/x E [a, b]. If we consider a decreasing sequence e n \ 0 
and denote the associated polynomials by / n , then the sequence (f n ) n tends 

uniformly to g on [a, b]. That is, max | f n — g\ -E 0, as n -E oo. 

[ a M 

The following approximation resuit holds for continuous functions on 
prodnct spaces: 


Propositiori 7.4.5 Let f : [a, b] x [c, d\ -E R a continuous function. Then 
for any e > 0, there is N > 1 and exists continuous functions gi E C[a, 6] 
and hi E C[c, d] ; z = 1,..., such that 


max 


iV 


f( x ,y) -£rt*)k(v) 


: 1 


< 


e. 


Proof: Consider the set of functions 


N 

A = | G(x,y) = 9 i{x)hi(y); gi E C[a, € C[c, d], AT = 1,2,... j. 

1=1 

It is easy to see that yl is closed with respect to linear combinations and 
multiplications, i.e., A is an algebra of C([a,6 ] x [c, d]). It is obvious that 
A contains the constant functions. To show that A separates points, let 
( x uVi) ~f~ ( x 2,y2) be two distinet points. Then, either x\ ^ X 2 , or y\ y 2 . 
In the former case the function G(x, y) = x • 1 separates the points, while 
in the latter, G(x, y) = 1 • y does the same. Applying the Stone-Weierstrass 
theorem yields the desired resuit. ■ 









Approximat ion Theorems 


211 


Remark 7.4.6 The previous resuit can be easily extended to a finite product 
of compact intervals. 

One of the applications of the Stone-Weierstrass Theorem is that neural 
networks with cosine (or sine) activation functions can learn any periodic 
function. The next section develops this idea. 

7.5 Application to Neural Networks 

Consider a one-hidden layer neural network with a real input, x, a one- 
dimensional output, y, and N neurons in the hidden layer. We assume the 
activation function is (j){x) — cos(x). Since this is neither a sigmoidal, nor a 
squashing function, none of the results obtained in Chapter 9 can be useful 
to this case. The network’s output is given by the sum 

N 

y = CL 0 + E OLj COS (WjX + 6j), 

3 = 1 

where Wj and olj are the weights from the input to the hidden layer and 
from the hidden layer to the output, respectively. The biasses in the hidden 
layer are denoted by 6j, while ao denotes the bias of the neuron in the output 
layer. We note that the activation function in the hidden layer is the cosine 
function, while the activation function in the output neuron is linear. See 
Fig. 7.3 a. 

We claim that the aforementioned network can represent continuous peri¬ 
odic functions. Consider a continuous function / : R 4 M, which is periodic 
with period T, that is, f(pc + T) = f(x). Let v — ^ be the associated fre- 
quency. Consider the weights of the form = iv. An application of trigono- 
metric formulas provide 


y 


N 

ao + E OLj COS (jvx + bj ) 

3 = 1 
N 

ao + E Oij cos (jvx) cos (bj) — Oij sin (jvx) sin (bj) 

3 = 1 
N 

ao + E dj cos (jvx) — cj sm(jisx) 

3 = 1 


N 

ao+ 


CLj COS 



— Cj sin 



5 




212 


Deep Learning Architectures, A Mathematical Approach 




a b 

Figure 7.3: a. One-hidden neural network with cosine activation functions. 
b. The cost function reaches its relative minima for certain values of the 
hyperparameter v. 


with coefficients aj — aj cos (bj) and Cj — aj sin (bj). By Exercise 7.11.7, these 
trigonometric sums are dense in the set of continuous periodic functions on 
R with period T. Therefore, Ve > 0, there are N > 1, aq, v E R such that 


N 

a 0 + E oq cos (jvx + bj) — f(x) 

3 = 1 


< e. 


Since cosine is an even function, it suffices to choose v > 0. Choosing the 
decreasing sequence e = -, yields a seqnence of freqnencies z/ n , which approx- 

Ti 

imates the proper frequency v — -f- of the target function /. 

This method can be slightly modihed to extract hidden frequencies from 
a continuous signal. Consider the target signal z = f(x), where x denotes 
time. The weight v from the previous computation is now taken as a hyper¬ 
parameter. So, for each fixed value of zq we tune the other parameters of 
the neural network until the cost function is minimum. This can be done, 
for instance, by the gradient descent method. Then, varying the value of zq 
we obtain different values of the associated cost function. We select those 
values of v for which the cost function achieves local minima, see Fig. 7.3 b. 
They correspond to the proper frequencies contained in the signal. The global 
minimum of the cost function is supposed to be obtained for the value of v 
obtained in the previous tuning, and this corresponds to the main frequency 
contained in the signal. 

This type of analysis can help in the study of stocks. The built-in fre¬ 
quencies can be used to trade stocks in a more efficient way. 









Approximat ion Theorems 


213 


Remark 7.5.1 The traditional way to find the frequencies contained into 

A 

a signal is to use the Fourier transform, f(y) — f e~ lux f(x) dx, which pro¬ 
vides the amplitude of the signal in the freqnency spectrum. High amplitudes 
correspond to proper frequencies of the signal f(x). 


7.6 Wiener’s Tauberian Theorems 

Sometimes, it is useful for a network to learn a time-dependent signal which is 
integrable, or square integrable on the entire real line. The following theorems 
can be used to prove the existence of these types of learning. The next two 
results are due to Wiener [128]. 


7.6.1 Learning signals in L\R) 

Theorem 7.6.1 Let f G L 1 (R) and considerthe translation functions fo(x) = 
f{pc + 9). Then the linear span of the family {fo]9 G R} is dense in L 1 (M) 
if and only if the Fourier transform of f never takes on the value of zero on 
the real domain. 


The way to use this theorem is as in the following. Consider a neural 
network with an activation function given by an integrable function, /, with 
/(£) ^ 0. Then for any g G L 1 (R) and any e > 0, there is a function 


N 

G{x) — QLjfix + 9j), oij , 0j G R, N = 1,2,..., 

3 = 1 


such that \\g — G\\i — f R \g(x) — G(x) \ dx < e. The function G{x) is the output 
of a one-hidden layer neural network with the activation function /, weights 


OLj, and biasses 6j 


We shall emphasize next some activation functions satisfying the above 
properties. 

(i) Double exponentia! Let f{x) — e~ x \ x \ with A > 0. Then 



f(x) | dx — 2 



Xx 


dx — — < oo 

A 


and hence / G L 1 (R). The Fourier transform is given by 







214 


Deep Learning Architectures, A Mathematical Approach 


(ii) Laplace potential Consider f(x) — 


a 2 + x 2 


with some a > 0, fixed. Since 


[ \f(x)\ dx — - tan 1 ( 
Jr a V 


x 

a 


oo 7T 

= 7T- < °0 
—cx) Za 


then / G L 1 (M). Note that the Fourier transform is 


7r 


/(a;) = —e _a|a;| / 0. 
a 


(iii) Gaussian Let f(x) 



with a > 0. We have / G L 1 (R), because 



The Fourier transform, f(uj) 


71 • u 

— e 4a 5 never vamshes 
a 


We conclude by stating that any one-hidden layer neural net with the 
input x G M and an activation function such as the ones given by (*)> (H), 
or (iii), we can learn any given integrable function g G L 1 (M), by tuning the 
weights cvj , biasses 0j, and the number of units, N, in the hidden layer. 


7.6.2 Learning signals in L 2 (R) 

Theorem 7.6.2 Let f G L?( R) and consider the translation functions fo(x) — 
f(x + 9). Then the linear span of the family { fo ; 6 G R} is dense in L 2 (R) if 
and only if the zero set of the Fourier transform of f is Lebesgue negligible. 

We use the theorem as in the following. Consider a squared integrable func¬ 
tion, /, with the set {x; /(£) 0} Lebesgue negligible. In practice, it suffices 

to have a finite or countable number of zeros for /(£). Then for any g G L 2 (R) 
and any e > 0, there is a function 

N 

G(x ) = a jf( x + 0j)i a ji 0j G R, IV = 1,2,..., 

3 = 1 

such that \\g — G\\\ — f R (g(x) — G(x)) 2 dx < e. We note that the function G(x) 
is the output of a one-hidden layer neural net with the activation function /, 
weights oy, and biasses 9j. 

It is worth noting that the previously presented activation functions (i), 
(ii), and (iii) are also square integrable and do satisfy the zero Lebesgue 
measure condition. 



Approximat ion Theorems 


215 


Remark 7.6.3 This method has some limitations. 

(i) One of them is that ali the previous activation are bell-shaped. Activation 
functions of sigmoid type or hockey-stick shape, such as the logistic function 
or ReLU are not in L 1 (R), and hence, Wiener’s theorem is not applicable in 
this case. 

(■ ii ) Another limitation is the fact that a closed-form expression for the Fourier 
transform is known only for a limited number of functions. 


7.7 Contractiori Principle 

This section deals with an application of the contraction principle to deep 
learning. 

If (5, d) is a metric space, an application A : S -E S is called a contraction 
if there is a 0 < A < 1 such that 


d(Ax , Ax') < A d(x, x'), Vx, x' E S. 


Example 7.7.1 Let S — R endowed with the metric d(x,x') = \x — x 1 

1 

Consider A : R -E R, given by Ax = -cosx. By the mean value theorem 
cosx — cosx' — sin(^)(x — x r ), with ^ between x and x'. Then 


Ax — Ax' 


cos x — COS X 


< - 

/ 

ry* _ ry* 

- 2 



Vx, x' E R. 


and hence, the function A is a contraction. 


A seqnence (x n ) n in a metric space (5, d ) is called a Cauchy sequence if 
the distance between any two elements, of a large enough index, can be made 
arbitrarily small. This is, Ve > 0, there is N > 1 such that 

d(x n , x m ) < e, Vn, m > N. 


It is easy to notice that any convergent seqnence is Cauchy. This is a conse- 
qnence of the triangle identity. If x n converges to x, then for any e > 0 


d(x n , x m ) < d(x n , x) + d(x m , x) 



e, 


Vn, rn > N. 


The converse statement is not necessarily true, so we need the following 
concept. A metric space (5, d) is called complete if any Cauchy seqnence is 
convergent. 


Example 7.7.2 The n-dimensional real space R n is a complete metric space 
together with the Euclidean distance. 



216 


Deep Learning Architectures, A Mathematical Approach 


The element xq E S is called a fixed point for the mapping A : S —> S if 
Ax o = xq. For instance, if S = R, the fixed points correspond to intersections 
between the graph of A and the line y — x. For several examples of fixed 
points the reader is referred to section E.6 in the Appendix. 

The following main resuit will be useful shortly for the study of neural 
net works. 


Theorem 7.7.1 (The Contractiori Principle) Let (S,d) be a complete 
metric space. Then any contractiori A : S S has an unique fixed point. 

The output of a one-hidden layer neural network, having the logistic acti- 
vation function a for all hidden neurons, is given by 


N 

A{x) — ^^aja(wjX bj). (7.7.1) 

3= 1 

where we assumed that both the input and output are real numbers, i.e., 
A ; R —> R. 

Proposition 7.7.2 Assume the following weights regularization conditions 
hold 

N N 

Y w ] < 1, (7.7.2) 

3 = 1 3 = l 

Then the input-output function A, defined by ( 7.7.1 ), is a contraction with 
X = 1/4. 


Proof: By the mean valne theorem, the logistic function satisfies 


a{u) — a(u') | < max 


a 


u — u 


' <1 
“ 4 


u — u 


Mu , u' G R. 


Using the previous relation, Canchy’s inequality, and the regularization con¬ 
ditions, we have the estimate 


N 


l^(*) - A (x)\ < Yi a i ( a ( w i x + b j) - a ( w j x ' + b j)j 


< 


3 = 1 
N 


E 

3 = 1 


a 3 


cr(wjX + bj ) — cr(wjx' + bj) 


< - 
~ 4 


< - 


x 


N 

3 = 1 
N 


a j 


Wj 


4 


_ rp 

tA7 


E 


a 


1 /9 N 

n e 


1/2 


re 


j 


<4 


7Y» _ ry» 


5 






















Approximat ion Theorems 


217 




Figure 7.4: A three-layer neural unit whose output is a contraction. 
which shows that A is a contraction. ■ 

Remark 7.7.3 In the case of an activation function with bounded slope, the 
regularization conditions can be weakened to 

N N 

3 = 1 i =1 

with M < sup | a'\. Namely, the norm of the weights is bounded above by the 
maximum slope of the activation function. 

Corollary 7.7.4 There is a unique real value, c E R, which is invariant 
through the network, that is, Ac — c. 

Proof: From Proposition 7.7.2 the function A is a contraction, and since the 
space R is complete, the Contraction Principle, Theorem 7.7.1, implies that 
A has a unique fixed point, c. ■ 

Remark 7.7.5 The unique value c is related to the saturation level of a 
recurrent neural network as we shall see in the next section. 


7.8 Application to Recurrent Neural Nets 

In the following, a neural unit will st and for a one-hidden layer neural net, 
for which regularization conditions (7.7.2) hold, and have an input xGl and 
an output A{x) E R given by (7.7.1). A neural unit is depicted in Fig. 7.4. 

We now concatenate neural units to construet a recurrent neural net¬ 
work as in Fig. 7.5. This means that the output of the nth neural unit is 



218 


Deep Learning Architectures, A Mathematical Approach 



Figure 7.5: A recurrent neural net constructed from three-layer neural units. 


the input of the (n + l)th neural unit, for n > 1. The output of the nth 
unit is A n (x ) = A (... A(A(x)) ...). For the time being, all neural units are 
considered identical from the weights and biasses point of view. There is no 
update in the weights, which are considered constants. The learning in the 
network is unsupervised, i.e., there is no a priori specified target function. 

Now, the qnestion is: What does this type of network leam? 

To be more precise, assume that for each input valne, x E R, the nth 
outcome, y n — A n (x ), is a convergent seqnence with the limit y, depending 
on x. If we dehne y — A(x), the function learned by the network is A(x). 
The next resuit deals with the existence, uniqueness, and the form of this 
function. 

Propositiori 7.8.1 The recurrent network constructed in Fig. 7.5 learns a 
constant c, which is the unique fixed point of the input-output mapping y — 
A(x). 

Proof: For each input valne x, consider the sequence of real numbers y n — 
A n (x). In order to prove that (y n ) is convergent, it suffices to show that it is 
Cauchy. 

Using the contraction property of the function A , we estimate the differ- 
ence between two consecutive terms in the sequence as 


hln+l — Un\ = \A n+1 (x) - A n (x)\ <-\A n (x) - A n 1 (x) < 

1 1 

< — \A{x) -x\ = —\yi -x 


4 


n 


4 


Let m — n + k and then estimate the difference between two distant terms 
in the sequence using the triangle inequality as 



















Approximat ion Theorems 


219 


Un+k Un — | Vn+k Vn+k —1 "F lUn+k —1 Vn+k— 2 4“ ‘ * 4“ |j/n+l 2/n 

< 


^1 - *l + - x l + ''' + x 


< 


4 n+k 

1 / 11 

V V 1 + 4 + 42 

\ yi - x 

4 


4 


?7 i - x 

1 


3-4 


n—1 


I2/1 - X 


(7.8.3) 


It is ciear that the difference \y n +k ~ Un\ can be made arbitrarily small, if n 
is large enough. Hence, (y n ) n is a Cauchy sequence. Since R is complete, the 
sequence (y n )n is convergent, with the limit y — lim n ^ 00 y n . Since y depends 
on x, we have y — A(x). 

In the following we shall prove that in fact y does not depend on the input 
x, and hence, A(x) is constant. In order to show this, it suffices to consider 
two arbitrary hxed inputs x, x' E R and show that for any e > 0 we have 

|A(x) — A(V)| < e. 

Let y' n — A n (x'), and y' be its limit. Triangle inequality provides 


|A(x) — A(V)| < |A(x) — A n (x) | + | A n (x) — A n (x')\ + \ A n (x') — A(x') 

1 < 

< y-Vn+^X-X 

e e e 

< 


+ \y' n - y' 


e e e 

— “I - — — —— c. 

3 3 3 


since for n large enough, each of the terms can be made smaller than e/3. 
Hence, y — A{x) is a constant. In order to determine its value, we estimate 
by triangle inequality 


s 

1 

IA 

1 A(y) - Vn 

+ \yn -y | = \A(y) - A(y n -i) + y n - y\ 


1 

e 

< 

T y - Un—l 

+ 2 


e e 


< 

2 + 2 = e ’ 



for n large enough and any a priori hxed e > 0. It turns out that A(y) — y — c, 
i.e., y is the unique hxed point of the function A, i.e., Ac — c. In conclusion, 
the network learns the hxed point of the input-output mapping A. ■ 


Remark 7.8.2 It is useful to measure the distance from an approximat ion 
point y n to the hxed point c. For this, we take the limit fc —» oo in relation 
(7.8.3) and obtain 


yn c 



< 


3 


1 

471—i 



3 • 4 n_1 


A(x) 



(7.8.4) 




















































220 


Deep Learning Architectures, A Mathematical Approach 



Figure 7.6: A recurrent neural net constructed from two e-close three-layer 
neural units. 


The integer n, which denotes the depth of the network, can be chosen such 
that the error becomes arbitrarily small. 

Changing the weights of the network leads to a different learning constant 
c. We shall study next the continuous dependence of the fixed point c with 
respect to the weights aj. First, we need to introduce a new concept. 


Definitiori 7.8.3 Two neural nets are called e-close if for some e > 0 their 
input-output mappings satisfy \A{x ) — A\x)\ < e, for all inputs x E R. 


Equivalently, this means that the output of one network belongs to an e- 
neighborhood of the output produced by the other network, i.e., 

A'(x) E {A(x) — e, A(x) + e). 

Roughly stated, the error between the outputs of the networks is uniform 
with respect to the input x. 


Lemma 7.8.4 Consider two neural units with input-output mappings given 
by A(x) — ^2j= i OLjd(wjX + bj) and A\x) — Ylj=i a 'j a ( w j x + bj), such that 

\ a 'j ~ a j I < e * Then the neural units are e-close. 

*J t/ 


Proof: Using that |cr(x)| < 1, a straightforward computation provides 


N 


N 


A' (x) — A{x) 


52 K - ai)<r{wj x + bj) < Wj ~ a i 

3 = 1 J =1 


< e. 


for all xGi 


Lemma 7.8.5 Consider two 
input-output mappings A{x) 

points. Then \c — c\ < —. 


e-close neural units as in Lemma 7 .8.4, with 
and A'[x), and denote by c and c' their fixed 




























Approximat ion Theorems 


221 



Figure 7.7: A resonance network. 


Proof: Construet the sequence y n — A /n (c), where A(c) — c. The sequence 
y n converges to c 7 , the hxed point of A'. If in relation (7.8.3) we take k —> oo 
and let n — 0, then we obtain a bound for the distance between the initial 
point and the limit 


lim | y n 

n —>oo 


4. 

2/o < ql2/i _ 2/o 


1 


or, equi valent ly, 


c — c 


< ||^'(c) - C 


4 

3 


A/(c) — A(c)| < 


4e 

y’ 


where we used that A and A' are e-close. ■ 

The continnous dependence of the hxed point of a neural unit with respect 
to the weights olj can be formulated as in the following: 


Propositiori 7.8.6 Consider two neural units as in Lemma 7.8.4- Then for 
any e > 0, there is an y > 0 ; such that ifY2n= 1 \ a 'i~ a j\ < V? then \c' — c\ < e. 


3e 

Proof: Let e > 0 be arbitrarily hxed and choose y — —. By Lemma 7.8.4 

the networks are 77 -closed. Then applying Lemma 7.8.5 leads to the desired 
resuit. ■ 


A recurrent neural network obtained by using two e-close neural units, 
which share the weights Wj and biasses is given in Fig. 7.6. The network 
learns the vector (c, c') of hxed points. 

















222 


Deep Learning Architectures, A Mathematical Approach 


7.9 Resonance Networks 


We consider a neural network with two layers, see Fig. 7.7. The information 
goes back and forth between the layers until it eventually converges to a 
steady state. Let N be the number of neurons in each layer. The initial state 
of the layer on the left is given by the vector Xq. This induces the state on the 
second layer Yq — (j){WX o + 6), where the weights are assumed symmetric, 
Wij — Wji. This information is fed back into the hrst layer, which provides the 
output X\ — c/)(WYq + b ). Here, is the activation function for both layers 
and b is the common bias vector. 

This way, we obtain the sequences (X n ) n and (Y n ) n dehned recursively 
by the System 

Y n = <t>(wx n + b) 

X n = <p(WY n - 1 +b ), n > 1. 


This can be separated into two recursions as 


Y n = F(Y n -\) 
X n = FiXn-i), 


with F : R n — > R n , F(U ) = <fi(W(f)(WU + b) + 6 ). The steady state of the 
network corresponds to the pair of hxed points, (X*, T*), where F(X*) — X* 
and F(Y*) = Y*. 

We shall treat this problem using the Contraction Principle. First, we 
write the problem in an equivalent way as in the following. Denote Zo — Xo, 
Z\ — To, Z^ — Xi, in general, 1 = G(Z n ), where G(Z) — cj){WZ + b). It 
suffices to show that Z n is a seqnence of vectors converging to some vector 
Z*. Since G : R n —>> R n , a computation shows 


G(Z')-G(Z )|| 2 < IIVG'|| \\Z'-Z\\ 2 < VNW^WocWWW \\z'-z\\ 2 , VZ, z' € R n . 


If the weights satisfy the inequality 


W || < 


1 

vxwnoo 


then G becomes a contraction mapping. Theorem E.6.5 assures that G has a 
unique hxed point, Z*, which is approached by the seqnence Z n . 

The conclusion is that if the weights are small enough, the state of 
the network tends to a stable state or resonant state. 


Remark 7.9.1 If N = 1 there are only two neurons which change alter- 
natively their States. If the activation function is the logistic function, the 
stability condition becomes w \2 = u> 2 i < 4. 











Approximat ion Theorems 


223 


Remark 7.9.2 An example of this type of architecture is Kosko’s bidirec- 
tional associative memory (BAM) [67]. It can be also regarded as a special 
case of Hopfield (1984) [58] and is is worth noting the relation with the 
Cohen-Grossberg model [25]. 

7.10 Summary 

This chapter contains some classical real analysis results useful for neural 
networks’ representations. 

For learning continuous functions / G C[a, &], the uniform convergence is 
required. The neural network produces a sequence of continuous functions, 
which converges pointwise to the target function /. In order to show that 
this convergence is uniform, we can use Dinrs theorem. 

Another technique for constructing a uniformly convergent sequence of 
functions on a compact set is by using Arzela-Ascoli’s theorem. In this case 
one can extract a convergent subsequence among a set of outcomes which is 
uniformly bounded and equicontinuous. Consequently, one-hidden layer neu¬ 
ral networks with sigmoid activation function can learn a continuous function 
on a compact set. Another approximation method of continuous functions on 
a compact set is using the Stone-Weierstrass theorem. This can be also used 
in neural networks to learn periodic functions. 

Target functions from L 1 and L 2 can be learned using Wiener’s Tanberian 
theorems. They apply mainly for bell-shaped activation functions. 

The contraction principle, which States the existence of a fixed point, can 
be applied to input-output mappings of a one-hidden layer neural network as 
well as to resonance networks. Iterating the input-output mapping we obtain 
that the network learns a certain function. 


7.11 Exercises 


Exercise 7.11.1 Let T = {tanh(ax + b); a, b € M}. Show that the family J- 
is uniformly bounded. 

Exercise 7.11.2 Let J 7 — {ax + 5; \a\ + \b\ < l,x G [0, 1]} be the family of 
affine functions on [0,1] under a regularization constraint. 

(a) Show that the family J 7 is equicontinuous and uniformly bounded; 

( b ) What can be the application of part (a) for the Adaline neuron? 


Exercise 7.11.3 Let M > 0 and consider the family of differentiable func¬ 
tions 


X={f: 


a, b ] —> R; 


b 

f'(x) 2 dx < M}. 


a 



224 


Deep Learning Architectures, A Mathematical Approach 


Show that T is equicontinuous. 

Exercise 7.11.4 Denote by D a dense subset in (a, b) (such as the rational 
nnmbers between a and b). Let T be a family of functions such that 

(i) it is equicontinuous on the interval (a, b ); 

(ii) for any xq G D , the family F is uniformly bounded at #o, be., there is 
M > 0 such that f(x o) < M, V/ G J~. 

Show that F is uniformly bounded at all points. 


Exercise 7.11.5 Let F be equicontinuous and let k > 1 be a hxed integer 
and {uq,..., w^} a set of weights with \wj\ < 1. Show that the set remains 
equicontinuous if it is extended by including all linear combinations of the 
f°rm Xj=i Wjfj■ 

Exercise 7.11.6 A neural network using sigmoid neurons is designed to 
learn a continuous function / G C [a, b\ by the gradient descent method. 
The network outcome obtained after the nth parameter update is denoted 
by G n (x) — G(x]w(n),b(n)). Assume that each time the approximation is 
improved, i.e., \f(x) — G n + i(x)\ < \f(x) — G n (x )|, for all x G [a, b} and any 
n > 1. Prove that G n converges uniformly to f(x) on [a, b} (that is, the 
network learns any continuous function on [a, b}). 


Exercise 7.11.7 Let / : R G K be a continuous periodic function with 
period T, that is f(x + T) = f(x) for all x G R. Show that for any given 
e > 0, there is a function 


N 


F(x ) = ao + (a n cos 

n=l 


27 rnx . 27 rnx\ 

-jr- + b n sin -jr- J 


such that | F(x) — f(x) \ < e for all xgR. 


We make the following remark. A function / : R —^ R that is an uniform limit 
of trigonometric polynomials on R is called almost periodic. The previous 
exercise States that any continuous periodic function on R is almost periodic. 
It is worth noting that / : R -g R is almost periodic if and only if the 
following Bohr condition holds: for any e > 0, there is p > 0 such that for 
any real interval /, with 1 1 < e, there is c G / such that | f(x) — f(x + c)\ < e, 
for all x G R, see [17]. 


Exercise 7.11.8 Proposition 7.7.2 has been proved under the hypothesis 
that a is the logistic sigmoid. Does the resuit hold for other sigmoid activation 
functions? 










Approximat ion Theorems 


225 


Exercise 7.11.9 Let {fj{x)}j> i be a set of equicontinuous and uniformly 


bounded functions on [a, b\. Show that if 
uniformly. 


lim fj(x) dx 

j - 7“-00 ' 


0, then lim fj— 0 , 

j^oo 


Exercise 7.11.10 (The set of integral transforms of L 2 -functions) Let 

M > 0 be a fixed constant and K : [a, b] x [c, d] —> R a continuous function. 
Let Tm be the set of functions on [a, b} given by 

rd 

g( s ) = / K(s,t)h(t) dt , 

J c 

with h satisfying h(t) 2 dt < M. Show that 

(а) J~m is equicontinuous; 

(б) J~m is uniformly bounded. 


Exercise 7.11.11 Let a, b G M, with a < b. 

(a) Show that ^4 = cq G i?, G is a subalgebra of C[a, b]. 

( b ) Show that for any e > o, there are aq,.. ., a n G M and nonnegative 
integers mj,..., m n > 0 such that 

n 

f(x) — < e, Vx G [a, 6]. 

i=l 

(c) Formulate a machine learning interpretation of this resuit. 



® 

Check for 
updates 


Chapter 8 

Learning with 
One-dimensional Inputs 


This chapter deals with the case of a neural network whose input is bounded 
and one-dimensional, x E [0,1]. Besides its simplicity, this case is important 
from a few points of view: it can be treated elementary, without the arsenal of 
functional analysis (as shall we do in Chapter 9) and, dne to its constructive 
nature, it provides an explicit algorithm for hnding the network weights. 

Both cases of perceptron and sigmoid neural networks with one-hidden 
layer will be addressed. Learning with ReLU and Softplus is studied in detail. 


8.1 Preliminary Results 


In this section we shall use the notions of derivative in generalized sense, 
measure, integration, and Dirac’s measure. The reader can hnd these notions 
in sections C.3, C.4, F. 2 , and F.3 of the Appendix. 


1. We shall hrst show that any simple function can be written as a linear 
combination of translations of Heaviside step functions. Assume the support 
of the simple function is in [0,1]. Then the function can be written as c{x) — 
a i\xi,xi + 1)5 where 0 = xq < x\ <•••< xjy = 1 is & partition of the 
interval [ 0 , 1 ] and l[ XijXi+1 ) is the indicator function of the interval \xi,Xi+ 1 ), 
namely 


1 



j^i+l 



1 , if X{ < x < Xi+\ 
0 , otherwise. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_8 


227 



228 


Deep Learning Architectures, A Mathematical Approach 


y = H(x-x i ) 


y = H{x-x i+1 ) 


y=1 


X : , X 


i +1 


1 


1 


0 


0 



0 


x, 


X; 


/ H" 1 


X, 


X; 


/H - 1 


Figure 8.1: The indicator function as a difference of two Heaviside functions, 
\xi,x i+l ){x) = H(x -Xi)-H{x- Xi+i). 


Since any indicator function of an interval is a difference of two step functions 


= H(x - Xi) - H(x - x i+ i), VO < i < N, 

see Fig. 8.1, it follows that there are real numbers co,..., cn such that 

N -1 N 

Y a i 1 [x i ,x i+ 1 )(®) = Y CiH ( X ~ Xi )’ (8- 1 - 1 ) 

i=0 ?’=0 

and hence a simple function could be written as a linear combination of 
Heaviside step functions. It is obvious now that the interval [0,1] can be 
replaced by any compact interval. 

2. As a consequence of the previous writing, it follows that the derivative of 
any simple function is a linear combination of Dirae measures 



d 

dx 


N N N 

Y c > H ( x - Xi ) = Y CiS ( x - x ^ = Y c ^ x i( x ^ 

i=0 i=0 i=0 


where the relation — xi) — S(x — Xi) holds in the generalized sense, 

see section F.2 of the Appendix. In other words, the sensitivity of the output 
with respect to the input, c'(x) is given as the superposition of N shocks, 
S(x — Xi), of magnitude C{. 

3. In this section it is shown that Dirac’s measure can be approximated 
weakly by certain probability measures with “bell- shaped” densities. More 
precisely, we have: 


Propositiori 8.1.1 Consider an activation function tp : R —> R satisfying 
the following properties: 

(i) p is increasing; 














Learning with One-dimensional Inputs 


229 


(ii) p(oo) — (f(— oo) — 1 ; 

(iii) (f is differentiable with \(p'(x)\ bounded. 

Let (p € (x) — (f(^), and consider [i e be the measure with density p' e , namely 
d/i e (x) — Lp' e (x)dx. Then /i e -G 5, as e \ 0 ; in the weak sense. 

Proof: For any smooth function with compact support, g G Cq°(R), we have 
by a change of variable 


•oo 


— (X) 


/ OO 1 x r OO 

-<p'{-)g(x)dx = / <p\y)g(ey)dy 

OO ^ ^ d— oo 


The Bounded Convergence Theorem (Theorem 0.4.3 in the Appendix) yields 

poo roo 

lim / <p'(y)g(ey) dy = g{ 0 ) / <p'(y) dy = g(0)(<p(oo) - <p(- oo)) = 5 ( 0 ). 


— OO 


—OO 


Hence 


•oo 


•oo 


•oo 


lim / g(x) d/i e (x) — lim / p' e (x)g(x) dx — g(0) — / g(x)S(x)dx. 


e —^0 


-00 


e—^0 


-00 


■00 


for all g G Cq°(R), which means that g e S, as e \ 0, in the weak sense. 

■ 

The function p' e (x) has the shape of a “bump function”, which tends to a 

“spike”, as e decreases to zero. An example of this type of activation function 

1 

is the logistic function. In this case (p e (x ) =- 3 ^-. 

1 + e e 


Remark 8.1.2 Since ip is increasing, then p' e > 0. We also have 

I dfi e (x)dx — I p e (x) dx — (p e ( 00 ) — (p e (—oo) — <p(oo) — cp(-oo) — 1. 

J M J M 

This implies that fi e are probability measures on (R, £>(R)). 


4. The convolution with the Dirae measure centered at a, namely 5 a (x) — 
S(x — a), is the same as a composition with a translation of size a: 

/ 00 

da(y)f(x — y) dy — f(x - a). 

-OO 

This can be stated equivalently by saying that the signal f(x) filtered 
through the Dirae measure S a is the same signal on which we have applied a 
phase shift. 





230 


Deep Learning Architectures, A Mathematical Approach 



Figure 8.2: A one-hidden layer perceptron network with input to hidden layer 
weights equal to 1 and hidden layer to output weights equal to cp the biasses 
are given by X{. 


5. Any continuous real-valued function defined on a compact, g : [a, b] — > R, 
is uniformly continuous , i.e., Ve > 0, 36 > 0 such that \/x,x' G [a, b} with 
x — x'\ < 6 , then | g(x) — g(x')\ < e. The proof of this classical statement is 
left to the reader in Exercise 8.8.3. 

To understand this statement heuristically, we think of g(x) as the 
speakeFs volume of a radio set whose sliding button is situated in posi- 
tion x. This position can be anywhere in the interval [a, b], where a stands 
for the minimum and b for the maximum possible positions. Then, for any 
e-adjustment of the volume, there is a 5-control of the sliding button, such 
that moving from any position x into a 5-neighborhood position x 7 , the cor- 
responding volume has a net change less than e. 


8.2 One-Hidden Layer Perceptron Network 

Consider a one-hidden layer neural network with the activation functions 
in the hidden layer given by the Heaviside function and a linear activation 
function in the output layer. We shall show that this network can potentially 
learn any continuous function g G C[ 0,1]. Equivalently stated, any continuous 
function on a compact interval can be approximated uniformly by simple 
functions. The network is represented in Fig. 8.2. 

We first note that the right side of relation (8.1.1) is a particular output 
of this type of network, see Fig. 8.2. The following resuit shows that the 
network approximates continuous functions by stair functions. 













Learning with One-dimensional Inputs 


231 


Propositiori 8.2.1 For any function g G C[ 0,1] and any e > 0, there is a 


N-l 


simple function c{x) — E Ci H (x — Xi ) such that 


i =0 


g[x ) — c{x) | < e, \/x G [0,1]. 


Proof: Fix e > 0 and let 5 > 0 be given by the uniform continuity property 
of the function g in the interval [0,1]. Consider an equidistant partition 0 = 
xo < x\ < • •• < xn — 1 with N large enough such that jj <8. Choose u in 
[0,1], fixed. Then there is a k <C N such that u G [x/^, x^- fi)- By the uniform 
continuity of g, we have 

\g(u) - g(x k )\ < e. (8.2.2) 

Construet the real numbers cq, ci,..., c/v-i such that 


g(x o) 
g(x i) 
5(^2) 


co 

Co + Ci 
Cq + Ci + C2 


5 (^- 1 ) — Co + Ci + • • • + C/V-l- 


For the previous u G [x&,Xfc + i) the Heaviside function evaluates as 


H(u — Xj ) = 


1, if j < k 
0, if j > k . 


The value at u of the simple function is 


iV-l k 

c(u ) = ^ CiH(u - Xi) = ^Cj = p(zfc) 

j=o 


i=0 


fact which implies 


b(X0 - c(u)| = 0 . 


(8.2.3) 


Now, triangle inequality together with relations (8.2.2) and (8.2.3) provide 
the inequality 


b(0 - c(u) | < bO) - 5 (x fc )| + bG/c) - c(u) | < e. 


Since u was chosen arbitrary, the previous inequality holds for any x G [0,1], 
which leads to the desired conclnsion. ■ 






232 


Deep Learning Architectures, A Mathematical Approach 


Remark 8.2.2 In the following we shall provide a few remarks: 

1. The simple function c(x) is the output function of a one-hidden layer neural 
network with N computing units in the hidden layer. Each unit is a classical 
perceptron with bias 6 i — —X{. The weight from the input to the hidden 
layer is w = 1. The weights from the hidden to the output layer are given by 
coefficients {q}. 

2. The weights C{ can be constructed as in the following. Divide the interval 
[0,1] into an equidistant partition 

0 = Xo < X\ < • • • < Xjy — 1 


and dehne the weights 


co = g( 0) 

ci = g(x i) - 5(0) 

C2 = g{x 2 ) - g(x 1) 


cjv-i = g(xN-i) ~ g(x N -2)- 


For N large enough (such that 7f < 5) the aforementioned relations produce 
weights such that \c(x) — g(x) \ < e. 

3. From the uniform continuity of g , for any e > 0, there is a number N large 

enough such that \g(xk+i) — g(%k)\ < e f° r an y k. Using the definition of c^, 

this means that | c^\ < e for any /c, if the partition is small enough. Hence, 

N -1 

the approximator c(x) — CiH( ) given by the proposition can be 

i =0 

assumed to satisfy \c^\ < e, Vfc, i.e., to have arbitrary small step variances. 
This property will be useful later. We note that in order to have the step 
variance small, the price paid is the large number of computing units in the 
hidden layer, N. 


8.3 One-Hidden Layer Sigmoid Network 

Consider a one-hidden layer neural network with a logistic sigmoid activation 
function, or more generally, with any activation function satisfying the 
hypothesis of Proposition 8.1.1, for the hidden layer and linear activation 
function for the output layer, see Fig. 8.3. We shall show that this network 
can potentially learn any continuous function g E C[ 0,1]. 

The main resuit is given in the following. 





Learning with One-dimensional Inputs 


233 



Figure 8.3: A one-hidden layer sigmoid neural network, with input to hidden 
layer weights w and hidden layer to output weights C{. 


Theorem 8.3.1 Consider an activation function satisfying the hypothesis of 
Proposition 8.1.1 and let g G C[ 0,1]. Then Ve > 0 there are C{ , w , di G M ; and 
N > 0 such that 


N 

g(x) - E cyp{wx + 0i) 

i —1 


< e, 


Vx G [0,1]. 


Proof: Since the proof is relatively involved, for the sake of simplicity we 
shall divide it into several steps: 

Step 1: Approximating the function g by a step function. 

By Proposition 8.2.1 there is a step function c(x) such that 


I g(x) - c(x) | < - 


Vx G [0,1]. 


(8.3.4) 


By remark 3 of Proposition 8.2.1 we may further assume that the coefficients 
of c(x) satisfy |c& | < for ali k. 

Step 2: Construet a “smoother” of the step function c{x) using convolution. 
Define the function family ip a (x) — ^(^), with a > 0. Consider the bump 

function ^ a (x) = (p' a (x) = By Proposition 8.1.1 ^(x) is the density 

of a probability measure, (i a , which converges weakly to the Dirae measure 
5(x), as a —> 0. Dehne the “smoother” of c(x) by the convolution 


Ca(x) = (c* ^a)(x). 


The geometrical flavor is depicted in Fig. 8.4. 












234 


Deep Learning Architectures, A Mathematical Approach 


n 


cjx) 


c(x) 


Figure 8.4: The step function c(x) and its “smoother” c a (x) — (c * tfj a )(x) 


We shall show that the smoother c a (x) is of the form 


N 


C a (x ) = ^ ~2d(p(wx + 9 i) 


1=1 


for some q, re, 6^ E R. 


Using the convolution properties and the facts introduced in section 8.1, we 
have in the sense of derivatives in generalized sense: 


c a (x) = 


(c * a)(x ) = (c * tp' a )(x) = (c' * (fa)(x) 
d N 

f ^ CiH(x - Xi)) * ipa(x) 


dx 


i=i 


N 


N 


= ( yi CiS(x - Xi )i * (fia(x) = 22 * <Pa)(x) 


1=1 
N 


N 


= y~]c^ a (x - Xj) = 

i=i i=i 


i=1 
N 


i=1 


_ rp . 


OL 


22 Ci<p(wx + 9i) 


1=1 


Xn 


with weight w — — and biasses di — — 

a a 

Step 3: Approximating the step function by a smoother. 
















Learning with One-dimensional Inputs 


235 


We shall show that Ve > 0 we have 


c(x) -c a (x) I < - 


Vx e R 


(8.3.5) 


for a small enough. Actually, this States the continuity of the functional 
a —)> c a at a — 0. To prove this, we evaluate the difference and split it in a 
sum of three integrals 

/ oo r oo 

ip a (y)dy- ?p a (y)c(x — y) dy 

-OO J — OO 


where 


•oo 


ip a (y)(c(x) - c(x - y)) dy 


-OO 


— h + h + h 


(8.3.6) 


h = 


h = 


h = 


-e' 
•oo 


ipa(y){c(x) - c(x - y)) dy 

) 

'tpa(y)(c(x) - c(x - y)) dy 
ipa(y){c(x) - c(x - y)) dy, 


for some e' > 0, subject to be specified later. 

Note that c(x) is bounded, with 

N N 

c(x)| = | J2^H(x - Xi) | < \a\ = M, 

i =1 2=1 

and hence \c(x) — c[x — y)\ < \c{x)\ + \c(x — y)\ <2 M. Also, since tends to 
8{x) as a —>• 0 , then tends to zero on each of the fixed intervals (—oo, —e') 
and (e / ,oo). By the Dominated Convergence Theorem, see Theorem C.4.2, 
I\ —>► 0 and /3 —)► 0 as a —)► 0 , so 

h + h < 

for a small enough. 

To evaluate /2 we note that the graph of c{x — y ) is obtained from the graph 
of c(x) by a horizontal shift of at most e r , see Fig. 8.5 a. For e’ small enough 
there is only one bump of the difference c(x) — c(x — y) contained in the 
interval (—e 7 , e), see see Fig. 8.5 b. The height of the bump is eqnal to one of 
the coefficients c&. Using the property of the coefficients developed in Step 1 , 
we have the estimate 


c(x) - c(x - y )I < \c k \ < - 











236 


Deep Learning Architectures, A Mathematical Approach 




Figure 8.5: a. The step functions c(x) and c(x 
of the difference c(x) — c(x — y ). 


y ) for I y < e'. b. The graph 


for any y E (—e 7 , e). 

Using the properties of ip a we can now estimate the second integral: 


h < 


< J ' ifofy) |c(x) - c(x - y)\dy < | J , ip a (y) 


e 

< - 
~ 4 


e 

/ lfa(y) = T • 

J —oo ^ 


To conclude Step 3, we have 

|/i + I 2 + / 2 I < - + - = e/2 

for a and e' small enough. This implies ineqnality (8.3.5). 

Step 4 : Combining all the previous steps and finishing the proof. 

Triangle inequality, together with Step 1 and Step 2, provides 

| g{x) - c a (x) | < | g(x) - c(x) | + | c(x) - c a (x) 

e e 

< —|— — e. 

- 2 2 

By Step 2 the smoother c a (x) is of the form G(x) = Ci(p(wx + 9i), which 

ends the proof. 


Remark 8.3.2 ( i ) It is worth noting that in the one-dimensional case the 
weight w does not depend on the index i. 

(ii) The output of the one-hidden layer sigmoid neural network, 

N 

y = T, c i^(wx + di), 

i —1 






























Learning with One-dimensional Inputs 


237 



Figure 8 . 6 : The polygonal line g e {x) converges uniformly on [a, b] to the con- 
tinuous function g(x), as e ^ 0. 


is obtained by filtering the output of the one-hidden layer perceptron network, 
y — ^2 i= i CiH(x — xf), through the hlter given by the bump function (p' a (x). 

{iii) Both results proved in the last two sections state that any continuous 
function on [0,1] can be learned in two ways: Proposition 8.2.1 assures that 
the learning can be done by simple functions, while Theorem 8.3.1 by smooth 
functions. However, the latter resuit is more useful, since the approximator 
is smooth, fact that makes possible the application of the backpropagation 
algorithm. 

8.4 Learning with ReLU Functions 

We start with a resuit regarding the polygonal approximation of continuous 
functions on a compact interval [a, 6 ], see Fig. 8 . 6 . 

Lemma 8.4.1 Let g : [a, b} -G R be a continuous function. ThenMe > 0 there 
is an equidistant partition 


a — Xq < x\ < • • • < Xjy — b 


such that the piecewise linear function g e : [a, b] —> R ? which passes through 
the points \X{, g(xi)j, i — 0 ,..., N, satisfies 


g{x) - g € (x) | < e, 


Vx G [a, b]. 


Proof: We shall perform the proof in two steps. 

Step 1. Dehne the partition and the piecewise linear function. 














238 


Deep Learning Architectures, A Mathematical Approach 


Let e > 0 be arbitrary fixed. Since g is uniform continuous on [a, 6], there 

is a S > 0 such that if \x' — x"\ < 5, then | g(x') — g(x")\ < e/2. Consider 

b — a 

N large enough such that — — < and define the equidistant partition 

_ , j(b ~ a) 

Xj — a -- 

by 


N 


N 

j — 0,..., N. The piecewise polygonal function is given 


9e(x) = g(xi-i) + mi(x - Xi-i), Vx e 
g(xi) - g{xi- 1 ) 


%i— 1 5 %i 


with the slope m ? ; — 

Xi X {—i 

Step 2. Finishing the proof by applying the triangle inequality. 

Let x G [a, b} be fixed. Then there is a partition interval that contains x, i.e. 
x G [xi,Xi+i\. By the uniform continuity of g , we have 


\g(x) -g(xi) | < 


(8.4.7) 


Using that g e is affine on the interval [x^,x^ + i], we have the estimation 


\de{xi) - g e {x) | < \g e (xi) - g t {xi-i)\ = \g(xi ) - 5f(^-i)l < ^ (8.4.8) 

where the last inequality used the uniform continuity of g. Now, triangle 
inequality together with relations (8.4.7) and (8.4.8) yield 


I g(x)-g e (x) < 


< 


\a(x) - 

\g( x ) - 
e e 

2 + 2 


g{xi )| + \g(xi) - g e {x)\ 
g{xi )| + \ge(xi) - g £ (x) 


— e. 


Since this inequality is satisfied by any x G [a, b }, the lemma is proved 


Remark 8.4.2 1. The resuit can be reformulated by saying that any con¬ 
tinuous function on a compact interval is the uniform limit of a sequence of 
piecewise linear functions. 

2 . A variant of this resuit in the case of a non-equidistant partition has been 
treated in Example 7.1.1. 

Recall the defmition of a ReLU function in terms of the Heaviside function 
as 

ReLU {x) - xH(x) = j f { ^ ° (8.4.9) 

Its generalized derivative is given by 

ReLU'(x) = (: xH(x ))' = x'H(x) + xH'{x) = H(x) + x5(x) = H(x), 



















Learning with One-dimensional Inputs 


239 



Figure 8.7: One-hidden layer neural network with ReLU activation f'unctioris. 

since xS(x ) = 0 , as a product of functions with disjoint supports. Another 
verification of this statement, directly from the definition, can be found in 
Example F.2.2 of the Appendix. 

The next resuit shows that a neural network with a ReLU activation 
function can approximate continuous functions on a compact set, see Fig. 8.7. 

Theorem 8.4.3 Consider a one-hidden layer neural network with ReLU acti¬ 
vation functions for the hidden layer neurons. Consider the output neuron 
having the identity activation function and the hias /3. Let g G C[ 0,1] be 
given. Then Ve > 0 there are aj , 6j G M, and N > 1 such that 


N -1 


g(x) — djReLU[x + 6j) — /3 < e, \/x G [0,1]. 

3=0 


Proof: Consider an equidistant partition ( xf)i of the interval [0,1] 


0 = < • • • < %N — F 


as given by Lemma 8.4.1. We need to show that parameters cq, and /3 can 
be chosen such that the approximating function 

N -1 

G{x) = Yi ctjReLU{x + 6j) + /3 

3=0 


1 In general, the product between a function f(x) and Dirae measure S(x) is f(x)S(x) = 
f(0)S(x). 












240 


Deep Learning Architectures, A Mathematical Approach 


becomes a piecewise linear function as in Lemma 8.4.1. We start by setting 
the biasses 0 j — —Xj and noting that 


ReLU (xk — xj) 


0, if k < j 
^r, if k > j, 


where we have taken into account formula (8.4.9). Therefore, the valne of 
G{x) at the partition point x & is given by 


N -1 

G(x k ) = E OiiReLU (xk — xj) + /3 

3=0 

= ■?) + p- 

j<k 

The N + 1 parameters, aj and /3, are uniquely obtained from the constraints 
G(xk) — g(xk), k = 0,..., N. Denoting for simplicity yk — g(xk ), the afore- 
mentioned constraints are written as 


yo = G(xo) = G( 0 ) = /3 

Vi = G(xi) = a 0 (l - 0)/N + /3 

2/2 = G(x 2 ) = 01 , o(2 — 0)/iV + oq (2 — l)/iV + /3 

2/3 — G(x 3 ) = <ao(3 — 0) / N + oq(3 — 1)/3V + 0^(3 — 2 )/N + /3 


yw = G(x n) 

= <ao(3Vo)/3V + • • • + ajv-i/3V + /3. 

The system is equivalent to 

yo 

= 3 

^v(yi - yo) 

— «0 

fV(y 2 - yo) 

— 2c^o H - oq 

w(y 3 - yo) 

— 3oro T 2oq T 0^2 

-^(yiv - yo) 

= Nao + (iV — l)oq + • • • + 


Subtracting consecutive equations yields 


/3 


OL 0 

<ao + ol \ 

(X 0 + oq + (X 2 


V 0 

N(yi - y 0 ) 
fV(y 2 - yi) 
fV(y 3 - y 2 ) 


«0 + «1 + • • • + CXAT-l 


N(yN ~ yjv-i)- 







Learning with One-dimensional Inputs 


241 


The system has the unique solution 

3 = yo 

a 0 = N(yi - y 0 ) 

«l = N(y 2 - 2yi + y 0 ) 

«2 = N(y 3 -2y 2 + yi) 


otN -1 = N(ypj — 2yjv-i + Vn- 2 )- 

The approximator function 


N-l 

G(x) — oijReLU[x — Xj) + (3 

3=0 

is piecewise linear, since its derivative 


N-l N-l 

G\x) — OijReLU 1 \x — Xj) — ajH(x — xj) 

3=0 3=0 


is constant on each interval (xi,Xi+ 1 ), given by a sum of ± olj. Applying 
Lemma 8.4.1 and choosing the polygonal line G{x) — g e (x), we obtain | g{x) — 
G{x) | < e for all x G [0,1], for any a priori arbitrary fixed e > 0. 


Example 8.4.4 (Hedging applicatiori) Let S denote the price of a stock 
and recall that the payoff of a European call with strike price K is given by 
C(S) — ReLU(S — K). Then any portfolio valne T > (*S), which is a continuous 
function of S, for 0 < S < S max can be learned by a one-hidden layer neural 
network, which has the output 


N-l N-l 

G(S) = ^ OijReLU (S - K 3 ) + /3 = E + e. 

3=0 3=0 

This can be constrncted by buying aj units of calls with strike price Kj , for j — 
0,..., IV — 1 and keeping an amount /3 in bonds. In order to hedge the portfolio 
position a trader should have to take the opposite market position, namely, 
to sell the new formed portfolio G(S). The conclusion is that any portfolio of 
stocks can be replicated with a portfolio containing calls and bonds. 




242 


Deep Learning Architectures, A Mathematical Approach 


8.5 Learning with Softplus 

The softplus activation function 

tp(x) — sp(x) = ln(l + e x ) 

is a smoothed version of the ReLU(x) function, see Fig. 2.7 a, as the next 
resuit shows. 


Propositiori 8.5.1 The softplus function is given by the convolution 

/ oo 

ReLU ( y)K(x — y) dy , 

-oo 


with the convolution kernel K(x) 


1 

(1 + e x ){l + e~ x ) 


Proof: We note that the kernel K(x) decreases to zero exponentially fast as 
x —> ±oo, i.e., lim^i^^ K(x) = 0. This condition suffices for the existence of 
the next improper integrals. We also note the following relation between the 
kernel and logistic function o: 


(j\x) — g{x){ 1 — cr(x)) = 

1 


(1 + e x )(l + e~ x ) 


1 + e- x V 1 + 
= K(x). 


The convolution can be now computed as 


/ oo roc 

ReLU(y)K(x — y) dy — / yK(x-y)dy 

-oo J 0 


■oo 


*X 


(x — t)K(t)dt— / (x — t)K(t)dt 


— oo 


= X 


— X 


/ x rx 

K (i) dt — / tK(t) 

-oo J — OO 

/ x rx 

cr'(t) dt — ta\t) 

-oo J —oo 


dt 


dt 


— xcr(x) — (ta(t) 


X 


*X 


t'cr(t) dt 


-oo 


— oo 


*X 


<j{t) dt — sp(x) 


-oo 


which proves the desired relation. We note that in the last identity we used 
formula (2.1.2). ■ 







Learning with One-dimensional Inputs 


243 



Figure 8 . 8 : The convolutional kernels K a (x) converge to a “spike” centered 
at zero, as a ^ 0 . 


Remark 8.5.2 An informal verification of the previous convolution formula 
is to compute the generalized derivative of the function 

f{x) — {ReLU * K)(x) — sp{x). 

Using convolution properties, as well as the derivation formulas 

ReLU' {x ) = H{x), H'{x) — 5{x), sp'{x ) = <r(x), 

we obtain 

f\x) = {ReLU * K)\x) - sp\x ) = {ReLU' * K){x) - a{x) 

= {H * cr'){x) — <j{x) — {H' * cr){x) — a{x) 

— {S * cr){x) — a{x) — a{x) — a{x) — 0, 

where H{x) and 8 {x) stand for the Heaviside function and Dirac’s measure. 
Hence, f{x) is piecewise constant. Since it is continnous (as a difference of 
two continuous functions), it follows that f{x) is constant. Then we make the 
point that lim f{x) — 0, which shows that f{x) — 0 for all real nnmbers x. 

x^—oo 

The properties of the convolution kernel K{x) are given in the next resuit, 
see Fig. 8 . 8 . 


Propositiori 8.5.3 ( i ) The kernel K{x) is a symmetric probability density 
function on R. 

{ii) Consider the scaled perturbation K a {x) — —K[ — ), with a > 0, and 

a \oi/ 

/ oo 

K “ {x) dx = 1 md 

Ta —> 8, as a —> 0 , in the weak sense. 


—oo 






244 


Deep Learning Architectures, A Mathematical Approach 


Proof: ( i ) It is obvious that K(—x) — K(x) and K(x) > 0. We also have 

/ oo n oo 

K(x) dx = <j\x) dx — cr(+oo) — cr(—oo) = 1. 

-oo J —oo 

(ii) Using the substitution x — ay and part (i), we have 

roo 1 f°° /X\ f°° 

/ K a (x) dx — — / K[ — )dx— / K(y)dy — 1 . 

J — oo ^ J— oo ' ^' J — oo 


For any function with compact support, g E Cg°(R), the Bounded Conver- 
gence Theorem, Theorem C.4.3, yields 


/ oo /»oo 

g(x)dfi a (x)= K a (x)g(x) 
-OO 4—00 


dx 


-oo 

•oo 

-oo 

•oo 


•oo 


K(y)g(ay)dy 


K(y)g(0)dy = g( 0) 


-OO 


g(x)S(x) dx — S(g) 


as cy —y 0. 


— oo 


^y x ) = ^sp"D) = -KP-)=K a {x) 


(8.5.10) 


which means that fi a —>• 5, as a —>• 0 , in the weak sense. 

We consider now the scaled softplus function 

p a ( x ) — ^ ^ = a ln(l + e x ^ a ) 

and note that 

1 „/x\ 1 

-sp - J = - 
a \ol/ a \a 

since sp' (x) — cr(x) and a'(x) — K(x). 

It is worth noting that the scaled softplus functions ( p a ( x ) converge point- 
wise to ReLU(x ), in a decreasing way, see Fig. 8.9. This means 

ip a ( x ) \ ReLU(x ), as a \ 0. 

The following resuit provides a formula for the smoothed polygonal func¬ 
tion G(x) through the filter K a , which is defined by (8.5.10). 

Lemma 8.5.4 Consider the convolution G a — G * K a , where 

N -1 

G(x) — OijReLU(x — Xj) + j3. 

3 =o 

There are parameters Cj, w, and 0«, depending on ctj and Xj, such that 


3 

N -1 


G a (x) — Cjsp(wx — 9j). 


Learning with One-dimensional Inputs 


245 



Figure 8.9: The scaled softplus (p a (x) f or values a = 1, 0.75, 0.5, 0.25. 


Proof: Using relation (8.5.10) and Proposition 8.5.3, part (ii), we have 

N -1 

G a (x ) = (G * K a )(x) — oijReLU(x — Xj) * K a (x) + /3 * K a (x) 



3 = 

0 



N- 1 





E 

ajReLU (x — 

Xj) * (p'a( x )+P 


3=0 





N- 1 





E 

(XjH' [x — Xj) 

* (fia(x) + /3 



3= 0 





iV-1 



iV-1 


E 

ajS(x — Xj) * 

<Pa(x) + (3 = 


Xj) 

3=o 



3=0 


N- 1 

( X — Xj 

ajacpl - 

V a 

N- 1 



E 

)+/» = E 

Cjsp(wx — Qj) 

+ /3, 

3=0 

3=o 



re = 

- 1 / cr, and 0j 

— Xj/a. We 

also used that 

/3 * . 


W 1 tll Uj — C Xj LJ 

since has the integral equal to 1. 


Remark 8.5.5 The function G a is the output of a one-hidden layer neural 
network with softplus activation functions for the hidden layer and identity 
function for the output neuron. The bias for the output neuron is /3. 

The following resuit is an application of Dini’s Theorem. 

Lemma 8.5.6 The sequence G a converges uniformly to G on [0,1], as a \ 0, 
i. e., Ve > 0 there is rj > 0 such that for a < ij, we have 


| G(x) — G a (x )| < e, \/x E [0,1]. 








246 


Deep Learning Architectures, A Mathematical Approach 


Proof: From the previous construction, we have 

N-l N—l 

G a (x) = ^2 aja<p(-——) + /3 = ^2 a jVa{x - Xi) + (3. 

j =0 a 3 =0 

Since ( p a (x ) \ ReLU(x) as a \ 0, see Exercise 8.8.6, then G a (x) converges 
pointwise and decreasing to G(x), for any x G [0,1]. Since G a and G are 
continuous on the compact interval [0,1], an application of Dini’s Theorem 
(Theorem 7.1.1) implies that G a converges to G, uniformly on [0,1]. ■ 

The next resuit is an analog of Theorem 8.4.3 for the case of the softplus func- 
tion. The idea is to approximate a given continuous function on a compact 
interval with a polygonal line, and then to “blur” a little this approximation, 
using a convolution with a given kernel. This procedure leads to a differen- 
tiable approximation of the initial continuous function. 

Theorem 8.5.7 Consider a one-hidden layer neural network with softplus 
activation function for the hidden layer and identity activation function for 
the outcome neuron. Let g G C[ 0,1]. Then Ve > 0 there are Cj,w,9j, /3 G R 
and N > 1 such that 


N -1 


g{x) — cjsp{wx — 6j) — /3 < e, \/x G [0,1]. 
3 =o 


Proof: Let e > 0 be arbitrary fixed. By Theorem 8.4.3, there is a function 
G(x) — q 1 (XjReLU ( x — xj) + /3 such that 

| g(x) — G(x) < Vx G [0,1]. 

Lemma 8.5.6 implies that for any a > o, sufficiently small, we have the 
estimation 


< -, \/x E [0,1]. 


\G(x) - G a (x) 

The previous two estimations, together with the triangle inequality, yield 


g(x)-G a (x )| < \g(x) - G(x)\ + \G(x) - G a (x) 

< - + - — e, Vx E [0,1]. 


Since Lemma 8.5.4 States that the function G a (x) is the output of a one- 
hidden layer network with softplus activation functions, we obtain the desired 
relation. ■ 









Learning with One-dimensional Inputs 


247 



Figure 8.10: The approximation of a continuous function g G C[ 0,1] by three 
approximators: G(x) — ^jReLU(x — xj) + / 3 , G a (x) c j s P( wx ~ 

0 j) + (3 and c(pc) = X^q 1 c iH(x - X{). 


Remark 8.5.8 We have approximated continuous functions on [0,1] using 
outputs of networks with different activation functions (Heaviside, ReLU, and 
Softplus). Now it is the time to ask the natural question: Which activation 
function is better? The following points will discuss this question. 

(i) The outcome of a one-hidden neural network with Softplus activation 
function is obtained applying a convolution with kernel K a to the outcome 
of a one-hidden neural network with ReLU activation function. 

(ii) We note the following gain of smoothness: if the network outcome is 
continuous in the case of using ReLU, then in the case of using Softplus it 
becomes differentiable. 


(iii) The approximation of a continuous function g G C[ 0 , 1 ] by different neu¬ 
ral outcomes can be visualized in Fig. 8.10. The approximation using ReLU 
is a polygonal approximation denoted by G(x). This is more accurate than 
the approximation using step functions, c(x). However, the approximation 
using Softplus, denoted by G a (x), is even better than G(x). All errors in this 


case are measured in the sense of the 


oo-norm. 


8.6 Discrete Data 

In real life only some discrete data is provided. The ability of finding a con¬ 
tinuous function rather than an arbitrary function that is arbitrarily close 
to the given set of data points says something about the learning machines 
ability to generalize of one-hidden layer neural networks. 


















248 


Deep Learning Architectures, A Mathematical Approach 


We assume the novel input data consists in N data points, (xpyi) G 
[0,1] x [a, 6], 1 < i < N. In order to move forward with our analysis we 
establish the following ansatz : yi are responses obtained from X{ applying a 
continuous function g. In other words, we assume the existence of a con- 
tinuous function g : [0,1] -G M such that g(xi ) = y^ 1 < i < N. This 
mathematical assumption holds true in most cases when data is provided by 
measuring the cause and the effect in physical phenomena. 

Let e > 0 be arbitrarily hxed. It can be inferred from the mathematical 
results proved in this section that the response generated by a one-hidden 
layer neural network, / e (x), has the maximal error from the given data of at 
most e, namely 

max| f e (xi) — yi\ < e. 
i 

This procedure goes beyond lookup tables or memorizing data because it 
instantiated prior knowledge about how the learning machine should gener- 
alize. This means that if you have provided with some new data, then the 
network’s output generates a response which is similar to a response from 
an input pattern and is close in a certain distance sense to the novel input 
pattern. 

8.7 Summary 

This chapter shows that one-hidden layer neural networks with one- dimen- 
sional input, x G [0,1], can learn continuous functions in C[0 ,1]. We prove 
this resuit in the cases when the activation function is either a step function, 
a sigmoid, a ReLU, or a Softplus function. 

The outcome of a network having a sigmoid activation function is a 
smoothed version of a network having a step activation function. Similarly, 
learning with Softplus is a smoothed version of learning with a ReLU func¬ 
tion. It is worth noting the mathematical reason behind the experimentally 
observed fact that the use of sigmoid activation function produces smaller 
errors than the use of step functions. Similarly, it is better off in many cases 
to use Softplus activation function rather than ReLU. 

The practical importance of the method consists in the ability of the 
network of generalizing well when it is applied to discrete data. The network 
potentially learns a continuous function which underlies the given data. 



Learning with One-dimensional Inputs 


249 


1 




-1 



i 


0 


x 


a 


b 




c d 

Figure 8.11: a. Bipolar step function. b. Box function. c. Sawtooth function. 
d. Shark-tooth function. 

8.8 Exercises 

Exercise 8.8.1 (generalized derivative) We say that f ' 1 = g in general- 
ized sense, if for any compact supported function, <f> E Cq°, we have 



Show that the following relations hold in the generalized derivative sense: 

(a) H\x — xo) — S(x — xq); 

(i b ) ReLUfx ) = H(x)\ 

(c) ReLU"{x ) = 5(x); 

(d) ( xReLU(x))' = 2ReLU{x ); 

(e) ( xReLU(x))" = 2if(x). 

Exercise 8.8.2 Equation (8.1.1) shows that any simple function can be writ- 
ten as a linear combination of Heaviside functions. Find the coefficients q in 
terms of cq. 
















250 


Deep Learning Architectures, A Mathematical Approach 


Exercise 8.8.3 (uniform continuity) Consider the continuous function g : 

a, b} -E R. Prove that Ve > 0, 35 > 0 such that Mx, x' E [a, 6] with \x—x r \ < S, 
then | g(x) — g(x / ) \ < e. 


Exercise 8.8.4 Use Dini’s theorem (Theorem 7.1.1) to prove Lemma 8.4.1. 


Exercise 8.8.5 Let / be one of the functions depicted in Fig. 8.11. Which 
one has the property that the set of finite combinations 

N 

{ ^2 otif(wiX - bi); N > 1, ai, w i: 
i =1 


is dense in C[0 ,1]? 


Exercise 8.8.6 Let (p a (x) — a ln(l + e X//a ). Show that ip a (x) \ ReLU(x) as 

a \ 0 . 







® 

Check for 
updates 


Chapter 9 

Universal Approximators 


The answer to the question “Why neural networks work so well in practice?” 
is certainly based on the fact that neural networks can approximate well a 
large family of real-life fnnctions that depend on input variables. The goal of 
this chapter is to provide mathematical proofs of this behavior for different 
variants of targets. 

The fact that neural networks behave as universal approximators trans- 
lates by saying that their output is an accurate approximation to any desired 
degree of accuracy of functions in certain familiar spaces of functions. The 
process of obtaining an accurate approximation is interpreted as a learning 
process of a given target function. 

The chapter deals with the existence of the learning process, but it does 
not provide any explicit construet ion of the network weights. The idea fol- 
lowed here is that it makes sense to know first that a solntion exists before 
we should start looking for it. 

The desired solntion exists, provided that a sufficient number of hidden 
units are considered and the relation between the input variables and the 
target function is deterministic. In order to obtain the solution, an adequate 
learning algorithm should be used. This techniqne should assure the success 
of using neural networks in many practical applications. 

9.1 Introductory Examples 

A neural network has the capability of learning from data. Providing a target 
function of a certain type, z = /(x), the network has to learn it by changing 
its internal parameters. Most of the time this learning is just an acceptable 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_9 


251 



252 


Deep Learning Architectures, A Mathematical Approach 


approximation. To facilitate understanding, in the following we provide a few 
examples. 


Example 9.1.1 (Learning continuous functions) Assume we need to 
create a neural network which is meant to rediscover Newton’s gravitational 
law. This is 


f(m 1 ,m 2 ,d) 


krn\m2 
d 2 ’ 


i.e., the force between two bodies with masses m\ and m 2 is inverse propor- 
tional with the square of the distance between them. We construet a network 
with inputs (xi, X2, X3) = (mi, 7712, d ) that can take continuous values in cer- 
tain 3-dimensional region K. The target variable, which needs to be learned, 
is z = /(mi, 7772 , d), with / : K —> R continuous function. The network has 
an output y — g(mi, 7772 , d; ic, 6), which depends on both inputs and some 
parameters (weights w and bias b ). The parameters have to be tuned such 
that the outcome function, g , can approximate to any degree of accuracy 
the continuous function /. Therefore, if one provides two bodies with given 
masses, situated at a known distance, then the network should be able to pro¬ 
vide the gravitational force between them with an accuracy exceeding any 
a priori fixed error. The existence of such neural network is given by Theo- 
rem 9.3.6. 


Example 9.1.2 (Learning finite energy functions) An audio signal needs 
to be learned by a neural network. It makes sense to consider signals of finite 
energy only, condition which will be formalized later by saying that the signal 
is a function in the space L ?. The input to the network is x = (£, is) with 
x\ — £, time, and X 2 — zq signal frequency. The target function, z = /(x), 
provides the amplitude in terms of time and frequency. The network’s out¬ 
put, y — g(t , zq re, 5), provides an approximation of the target, which can be 
adjusted by changing the values of the weights w and bias b. The existence 
of a neural network with this property is provided by Proposition 9.3.11. 


9.2 General Setup 

The concept of the neural network as an universal approximator goes as in the 
following. Let x be the input variable, z be the target, or desired variable, 
and denote the target function by z = /(x), with / in a certain function 
space, 5, which is a metric space, endowed with distance d, i.e., / G (5, d). 
The space of the neural network outcome functions, g, is denoted by U and 
is assumed to be a subset of S. In fact, we choose the activation function of 
the network such that the outcome space is properly adjusted such that the 
previous inclusion is satished. 



Universal Approximators 


253 


Now, we say that the neural network is an universal approximator for the 
space (<S, d) if the space of outcomes U is d-dense in i.e., 

V/ E Ve > 0, 3g E U such that d(/ — g) < e. 

This means that for any function / in 5, there are functions in U situated 
in any proximity of /. The process by which the neural network approxi- 
mates a target function / G S by functions in U is called learning. U is the 
approximation space of the target space S. 

There are several types of target spaces, (<S, d), which are useful in appli- 
cations. The most familiar will be presented in the following examples. The 
reader can hnd more details regarding the normed vectorial spaces in the 
Appendix. 

Example 9.2.1 Let S — C(K) be the space of real-valued continuous func¬ 
tions on a compact set K. The distance function is given by 

d(f,g) = sup 1/0) - g(x)\, Mf,g e C(K), 

xCiK 

which is the maximum absolute value difference between the valnes of two 
functions on the compact set. Since |/ — g\ is continuous, the maximum is 
reached on AT, and we may replace in the definition “sup” by “max”. It is 
worth noting that this distance is indnced by the norm ||/|| = su.p xeK 1/0)1) 
case in which (C(AT), || • ||) becomes a normed vectorial space. 

Example 9.2.2 If S — L l (K) is the space of real-valued integrable functions 
on a compact set K , the distance function in this case is 

d{f,g)= f |/0) — 50)1 dx, Vf,g e L l (K). 

Jk 

This can be interpreted as the total area between the graphs of the functions 
/ and g over the set K. The corresponding norm is the L 1 -norm dehned by 
\\f\\i = J K \f( x)\. Conseqnently, (L 1 (AT), || • ||i) becomes a normed vectorial 
space. 

Example 9.2.3 Consider S — I?(K ) be the space of real-valued sqnared 
integrable functions on a compact set K. The distance function is 

d{f,g) = ( J 1/0) ~ g(x)\ 2 dx^j / , V/, <7 E L 2 (K). 

The interpretation of this is the energy difference between two finite energy 

signals / and g over the set K. The corresponding norm is the L 2 -norm 

/ \ 1/2 

defined by \\f\\2 — ( f K l /( x )| 2 ^ x ) 5 with respect to which L 2 (K) becomes 

a normed vector space. 



254 


Deep Learning Architectures, A Mathematical Approach 


Example 9.2.4 Let K — be a finite set in R n and consider 

S = {f:K —> R} be the set of real-valued functions defined on a finite set. 
In this case the distance can be either the sum of sqnared errors 


d s (f,g ) 


52 I f( x j)~g( x j) 


1/2 


Xj eK 


or the absolute deviation error 

da(f,g ) = 52 ~ 9{ x j) ■ 

Xj^K 

These distances are commonly used in regression analysis. Also, in the case 
of a classical perceptron, the inputs are finite, with K — {0,1}. It is worth 
noting that d s (f,g) is the Euclidean distance in R n between points f(pc) and 
g(x), while d a (f,g) is the taxi-cab distance between the same points (this is 
the distance measured along the coordinate curves, similar to the distance a 
cab would follow on a grid of perpendicular streets, like the ones in New York 
City). 

Example 9.2.5 In a very general setup, the target space can be considered 
to be the space of Borel-measurable functions from R n to R: 


A4 n — {/ : R n —> R; / Borel-measurable}. 


This means that if £>(R) and £>(R n ) denote the Borei 0-helds on R and R n , 
respectively, then / -1 (£>(R)) C S(R n ), or, equivalently 


r\[a,b\) e B(r) 


Va, b G R. 


In order to introdnce a distance function on A4 n , we shall consider hrst a 
probability measure /i on the measurable space (R n ,S(R)), i.e., a measure 
fi : £>(R) —> [0,1] with /i(R n ) = 1. Define 


d^if, g ) 


inf{e > 0; jx{\f 


g\> e) < e} 


V/,3 € M n . 


Then (A4 n , d^) becomes a metric space. It is worth noting that the elements of 
M n are determined almost everywhere, i.e., whenever /i[x] f(x) — g{x)) — 1, 
the measurable functions / and g are identified. 

It will be shown in section 9.3.4 that the convergence in the metric d M is 
equivalent to the convergence in probability. 


We shall start the study of the learning process with the simple case of a 
deep network that has only one hidden layer. 






Universal Approximators 


255 


9.3 Single Hidden Layer Networks 

This section shows that a neural network with one hidden layer can be used 
as a function approximator for the spaces C(I n ), L 1 (/ n ), and L 2 (/ n ), where 
I n is the unit hypercube in R n . This means that given a function /(x) in 
one of the aforementioned spaces, having the inpnt variable x G I n — [0, l] n , 
there is a tuning of the network weights such that the neural network output, 
g(x), is “as close as possible” to the given function /(x), in the sense of a 
certain distance. 


9.3.1 Learning continuous functions / G C(I n ) 

Before the development of the main results, some preliminary results of func- 
tional analysis are needed. The first lemma States the existence of a separation 
functional, which vanishes on a subspace and has a nonzero value outside of 
the subspace. For the basic elements of functional analysis needed in this 
section, the reader is referred to section E in the Appendix. 


Lemma 9.3.1 Let U be a linear subspace of a normed linear space X and 
consider xq G X such that 


dist{xQ,U) > 5, 


for some S > 0, i.e., \\xq — u 
functional L on X such that: 

(i) \\L\\ < 1 

(ii) L(u) = 0, G U, i.e.. 

(iii) L(x o) = S. 


> S, Vu E U. Then there is a bounded linear 


L\u — 0 


Proof: Consider the linear space T generated by U and xo 

T = {£; t = u + Axo, u G U, A G R} 
and define the function L : T —> R by 

L(t) — L(u + Axo) — A S. 

Since for any ti, t^ G T and aGlwe have 


L(t\ + ^ 2 ) 
L(at) 


L(u\ + U 2 + (Ai + A 2 )xo) — (Ai + A2)5 

L(t\) + Lfa) 

aL(t), 


it follows that L is a linear functional on T. 






256 


Deep Learning Architectures, A Mathematical Approach 



Figure 9.1: 


The functional L that vanishes on subspace U, see Lemma 9.3.1. 


Next, we show that L(t) < ||£||, Mt G T. First, notice that if u G U, then 


u 


also — — G U and, using hypothesis, we have 
A 


u 

X0+ x 


u 


x 0 - (~t)II > ^ 


or, equi valent ly, 


5 


xo + j 


< 1. This is restated as |A |5 < \\u + Axq||. Then 


L{t) — L(u + Axo) — \5 < |A|(5 < || u + Axo|| = ||t||. 

Hence, L{t) < ||t||, for all t G T. 

Applying the Hahn-Banach Theorem, see Appendix, section E.3, with the 
norm p(x) — ||x||, we obtain that L can be extended to a linear functional 
on A, denoted also by L, such that L(x) < ||x||, for all x G A, see Fig. 9.1. 
This implies ||L|| < 1, so the extension L is bounded. From the dehnition of 
L, we also have 


L{u ) — L{u + 0 • xo) = 0-5 = 0, Mu 
T(xo) = L (0 + 1 • xo) = 1 • 5 — S > 0, 

which proves the lemma. ■ 

The previous resuit can be reformnlated using the language of dense sub- 
spaces as in the following. First, recall that a subspace U of T is dense in T 
with respect to the norm || • || if for any element x G T there are elements 
u U as close as possible to x. Equivalently, Vx G T there is a sequence 













































Universal Approximators 


257 


x, as n 


( u n ) n in U such that u n 
Me > 0, there is u <ElA such that 


oo. This can be restated as: Vx G X 


u — x 


< e. 


Consequently, the fact that the subspace U is not dense in T can be 
described as: there are elements xo G X such that no elements u EU are close 
enough to xo; or: there is a 5 > 0 such that Mu E U we have ||xe — xo|| > 5. This 
is just the hypothesis condition of Lemma 9.3.1. Thus, we have the following 
reformnlation: 


Lemma 9.3.2 (reformulation of Lemma 9.3.1) Let U be a linear, non- 
dense subspace of a normed linear space X . Then there is a bounded linear 
functional L on X such that L ^ 0 on X and L\ u — 0. 

Denote by C(I n ) the linear space of continuous functions on the hypercube 
In = [0, l] n , as a normed space with the norm 


/II = sup \f(x)\, 


v/ <E C{I n ). 


Let M(l n ) denote the space of finite signed Baire measures on I n (see sections 
C.3 and C.9 of the Appendix for definitions). The reason for using Baire 
measures consists in their compatibility with compactly supported functions 
(which are Baire-measurable and integrable in any finite Baire measure). 
Even if in general Baire sets and Borei sets need not be the same, however, 
in the case of compact metric spaces, like / n , the Borei sets and the Baire 
sets are the same, so Baire measures are the same as Borei measures that 
are finite on compact sets. These will be useful remarks when applying the 
representation theorem, Theorem E.5.6, of Appendix in the proof of the next 
resuit. 

The next resuit States that there is always a signed measure that vanishes 
on a non-dense subspace of continuous functions. 


Lemma 9.3.3 LetU be a linear, non-dense subspace ofC(I n ). Then there is 
a measure p G M[l n ) such that 



hdp — 0 , 


MheU. 


Proof: Considering X — C(I n ) in Lemma 9.3.2, there is a bounded linear 
functional L : C(I n ) —> R such that L ^ 0 on C(I n ) and L\ u — 0. Apply¬ 
ing the representation theorem of linear bounded functionals on C(I n ), see 
Theorem E.5. 6 from Appendix, there is a measure /i G M[l n ) such that 


L(f)= [ fdn, V/ G C(I n ). 

J I n 








258 


Deep Learning Architectures, A Mathematical Approach 


In particular, for any h <EU, we obtain 

L(h) — hdp — 0, 

j i n 


which is the desired resuit. ■ 

Remark 9.3.4 It is worth noting that L 0 implies (i 0. 

The next resuit uses the concept of discriminatory function introduced in 
Chapter 2, Dehnition 2.2.2. 

Propositiori 9.3.5 Let o be any continuous discriminatory function. Then 
the finite sums of the form 

N 

G(x ) = ajcr(wjx + 0j), Wj G R n , oy, 9j G R 

3 = 1 


are dense in C(I n ). 


Proof: Since a is continuous, it follows that 

N 

IA — {G\ G{x) — ajcr(wjx + 9fi)}. 

3 = 1 

is a linear subspace of C(I n ). We continue the proof adopting the contradic- 
tion method. 

Assume that U is not dense in C(I n ). By Lemma 9.3.3 there is a measure 
/i G M(I n ) such that 

/ hd/ji = 0, Mh G Z4. 

a i n 

This can also be written as 

N r 

otj / cr(wjx + 0j) d/a — 0, Mwj G R n , oy, G R. 

^ In 

By choosing convenient coefficients ay, we obtain 



er 


+ 0) dp — 0, 


Vie G R n , 9 G R. 


Using that a is discriminatory, yields /i = 0, which is a contradiction, see 
Remark 9.3.4. ■ 


Universal Approximators 


259 



Figure 9.2: The one-hidden layer neural network which approximates contin- 
uous junctioris. 


The previous resuit States that V/ G C(I n ) and Ve > 0, there is a sum 
G(x) of the previous form such that 


G(x) - f(x) 


< e, 


\/x G I n . 


Equivalently, the one-hidden layer neural network shown in Fig. 9.2 can learn 
any continuous function / G C(I n ) within an e-error, by tuning its weights. 

The next resuit States that, given no restrictions on the number of nodes 
and size of the weights, a one-hidden layer neural network is a continu¬ 
ous function approximator, see Fig. 9.2. Here the input is the vector x T — 
(xi,..., x n ) G I n and the weights from the input to the hidden layer are 
denoted by (ici,..., wn). Each weight is a vector given by wJ=(pjjp ..., wjfi)- 
The weights from the hidden layer to the output are (oq,..., <a/v), with each 
aj real number. The number of neurons in the hidden layer is N, and their 
biasses are (#i,..., On). Note that in the outer layer the activation function 
is linear, ip(x) — x, while in the hidden layer it is sigmoidal. 

Theorem 9.3.6 (Cybenko, 1989) Let a be an arbitrary continuous sig¬ 
moidal function in the sense of Definition 2.2.1. Then the finite sums of the 
form 

N 

G(x) — ajcr(wjx + 0j), Wj G KT, ce/, 6j G R 

3 = 1 

are dense in C(I n ). 

Proof: By Proposition 2.2.4, any continuous sigmoidal function is discrimi¬ 
nat ory. Then applying Proposition 9.3.5 yields the desired resuit. ■ 









260 


Deep Learning Architectures, A Mathematical Approach 


Remark 9.3.7 If the activation function is a — , then all functions 

G(x) are analytic in x, as finite sums of analytic functions. Since there are 
continuous functions in C[I n ) that are not analytic, it follows that the approx- 
imation space U is a proper subspace of C(I n ) (namely, U 7 ^ C{I n )). 

The next resuit is valid for any continuous activation function, see [59]. 
The price paid for this is to switch from the E-class networks, which approxi- 
mates target functions using sums, to ELbclass networks, which involves sums 
of prodncts. 

Theorem 9.3.8 (Hornik, Stinchcombe, White, 1989) Let (p : R -G R 

be any continuous nonconstant activation function. Then the finite sums of 
finite producis of the form 


M N k 

G(x ) = TT v( w Jk x + 9jk), (9.3.1) 

k =1 .7 = 1 

with Wjk G R n , /3j, 6jk G R, x G / n , M, = 1,2,..., are dense in C(I n ). 

Proof: Consider the set U given by the finite sums of products of the type 
(9.3.1). Since ip is continuous, then for any G, Gi, G 2 G Z4, we obviously have 

Gi + G 2 G G, GiG 2 G G, AG G G, VA G R, 

which shows that U is an algebra of real continuous functions on I n . Next we 
shall verify the conditions in the hypothesis of Stone-Weierstrass theorem, 
see Theorem 7.4.4: 

• U separates points of I n . Let x, y G / n , x 7 ^ y. We need to show that there 
is G G G such that G(x) 7 ^ G(y). 

Since is nonconstant, there are a, b G R, a 7 ^ 5, with (/?(a) 7 ^ ^( 6 ). Pick 
a point x and a point y in the hyperplanes {w T x-\-0 — a} and {w T x + 0 — 6 }, 
respectively. Then G(u) — p{w T u + 6) separates x and y: 

G(x) — p(w T x + 6) = (/?(a) 

G(y) = ^(/i/ + 0 )=^)^(a). 

• Z4 contains nonzero constants. Let 0 be such that (p(9) 7 ^ 0, and choose 

the vector w T — (0,..., 0) G R n . Then G(x) = p{w T x + 6) = y?(0) 7 ^ 0 is a 
nonzero constant. Mnltiplying by any nonzero real number A we obtain that 
all nonzero constants are contained in U. 

Applying now Stone-Weierstrass Theorem, it follows that U is dense in 
G(/ n ), which ends the proof of the theorem. ■ 



Universal Approximators 


261 


Remark 9.3.9 1. Since Stone-Weierstrass Theorem holds in general on com- 
pact spaces, replacing the input space I n by any compact set K in R n (i.e., 
a bounded, closed set) does not affect the conclnsion of the theorem. 

2. This approach is shorter than the one using the Hahn-Banach Theorem 
used in the proof of Theorem 9.3.6, but it cannot be extended to E-class 
approximators. 


9.3.2 Learning square integrable functions / £ L 2 (/ n ) 

It makes sense to make the network learn signals / of finite energy on / n , 
i.e., functions for which 


/ f{x ) 2 dx < oo. 

j i n 


where dx denotes the Lebesgue measure on / n . The space of functions with 

this property is denoted by L 2 (I n ). This is a normed linear space with the 

/ \ 1/2 

norm \\fW 2 = f fj f(x) 2 dx) . The interaction between two signals f,g £ 

L 2 (I n ) is measured by their scalar product (f,g) = fj f(x)g(x) dx. Two 
signals /, g, which do not interact have a zero scalar product, (f,g) — 0, case 
in which they are called orthogonal ; in this case we adopt the notation / _L g. 

Consider an activation fnnction a with the property 0 < a < 1. Then 
h(x) = cr(w T x + 6) £ L 2 (/ n ), for any w £ W 1 and 8 £ R, since 


/ h 2 {x)dx— / cr(w T x + 6) 2 dx < / dx 

J /, , J I n J J y, 


= 1 . 


In the following we shall carry over the approximation theory from C(I n ) as 
mnch as possible. Consider all finite sums of the type 


N 

IA — {G ; G(x) — ajcr(wjx + Qj), Wj £ R n , «j, 0 j £ R, N 


0 , 1 , 2 , 


• } 


3 = 1 


and notice that IA is a linear subspace of L 2 (/ n ). We are looking for a property 
of a such that IA becomes dense in L 2 (/ n ). 

The following discussion supports the next definition, Definition 9.3.10. 
We start by assuming that IA is dense in L 2 (/ n ), with respect to the L 2 -norm. 
Consider g £ L 2 (/ n ), such that g AIA (i.e., g is orthogonal on all elements 
of the space U). Then there is a seqnence (G n ) n in IA with G n —> g in L 2 , 
as n —> 00. Since g _L G n , then ( 5 , G n ) = 0, and using the continuity of the 
scalar product, taking the limit, yields (g,g) — \\g\\ 2 — 0 . Hence g — 0 , almost 
everywhere. 


262 


Deep Learning Architectures, A Mathematical Approach 


The relation g AIA can be written equivalently as 



a(w T x + O)g(x) dx = 0. 


Mw G R n , 6 G R. 


(9.3.2) 


We have shown that relation this implies g — 0 a.e. The conclusion of this 
discussion is the following: if a finite energy signal, g , does not interact with 
any of the neural outputs, a(w T x + 6) (i.e., relation (9.3.2) holds), then g 
mnst be an almost everywhere zero signal. We shall use this as our desired 
property for the activation function a. 


Definitiori 9.3.10 The activation function a is called discriminatori/ in L 2 - 
sense if: 

(i) 0 < a < 1 

(ii) if g G L 2 (I n ) such that 



a(w T x + O)g(x) dx = 0. 


Vie G R n , 6 G R, 


then g — 0. 

We shall provide next the analog L 2 -version of Proposition 9.3.5. 

Proposition 9.3.11 Leter be discriminatori/ function in L 2 -sense. Then the 
finite sums of the form 

N 

G(x) — aja(wjx + 6j ), Wj G R n , Oj, 6j G R 

3 = 1 


are dense in L 2 (I n ). 

Proof: By cont radiet ion, assume the previously dehned linear subspace U is 
not dense in L 2 (I n ). From Lemma 9.3.2 there is a bounded linear functional 
F on L 2 (I n ) such that F ^ 0 on L 2 (I n ) and F\ u — 0. By Riesz theorem of 
representation, see Theorem E.5.1 in Appendix, there is g G L 2 (I n ) such that 


F(f) = f f(x)g(x)dx , 

j i n 


with 



g || 2 . The condition F\ u — 0 implies 


G(x)g(x ) dx — 0, 


MG G U. 






Universal Approximators 


263 



Figure 9.3: The upper half-space for: a. the unit square / 2 . b. the unit cube 

h- 


In particular, 

/ a(w T x + O)g(x) dx — 0. 

j i n 

Using that a is discriminatory in L 2 -sense, it follows that g — 0. This triggers 
11-^11 = || 5 1 |2 = 0, which implies that F = 0, contradiction. 

It follows that U is dense in L 2 (I n ). ■ 

The meaning of the previous proposition is: Mh G L 2 (/ n ), Ve > 0, there is 
G of the previous form such that 



| G(x) — h(x )| 2 dx < e. 


Or, equivalently, any L 2 -function with the input in the hypercube I n can be 
learned by a one-hidden layer network within a mean square error as small 
as we desire. 


The only concern left is whether the usual activation functions introduced 
in Chapter 2 are discriminatory in L 2 -sense. We shall deal with this issue in 
the following. First, we develop a resuit, which is an L 2 -analog of Lemma 
2.2.3. 


Lemma 9.3.12 Let g G L 2 (/ n ) such that / g(x) dx — 0, for any half- 

J7-L w ,o 

space W-W, q — {x ; w T x + 6 > 0} H I n . Then g — 0 almost everywhere. 

Proof: Let g G L 2 (I n ) and w G R 2 be fixed and consider the linear functional 
F : L°°(R) —> R dehned by 















264 


Deep Learning Architectures, A Mathematical Approach 


with h bounded on I n . Since 


F(h) \— / h{w T x)g(x)dx < / g{x) dx ||/i||oo < ||g||2 \\h\ 

j i ri j i n 


OO 5 


it follows that F is bounded, with ||F|| < Wah- 

Evaluating F on the indicator function h(u ) = 1(6/, oo)( u )? and using the 
hypothesis, yields 


F(h) — / 1 ( 0 ,oc)( wT x)g(x) dx = / g(x) dx — 0. 

J In JH w ^-e 

Similarly, F vanishes on h(u) — 1 [q j00 )(u), since the hyperplane {w T x — 6 — 
0} has zero measure. By linearity, F vanishes on combinations of indicator 
functions, such as 1 a, with A interval, and then vanishes on finite sums of 
these types of functions, i.e., on simple functions. Namely, if s — ^2i =i a j^A , 
with {Aj}j disjoint intervals in R, then F(s) — 0. Using that simple functions 
are dense in L°°(R), then 

F{h) = 0, Mh G L°°(R). 


This relation holds in particular for the bounded functions s{u) — sinu and 
c(u) — cosu, i.e., F(s) — 0 and F(c) — 0. This way we are able to compute 
the Fourier transform of g 



e lw X g(x ) dx — 


cos (w T x)g(x) dx + i I sin (w 1 x)g{x) dx 


T 


i 


n 


I 


n 


I 


n 


F(c ) + iF(s) = 0. 


Since the Fourier transform is one-to-one, it follows that g — 0, almost every- 
where (For the definition of the Fourier transform, see Appendix, section 
D.6.4). 


Remark 9.3.13 It is worth noting that by choosing a convenient value for 
the parameter 0, the upper half-space, see Fig. 9.3, may become the entire 
hypercube. Then the function g considered before has a zero integral, 

J In g(x) dx = 0. 

Now we shall provide examples of familiar activation functions, which are 
discriminatory in L 2 -sense. The one-hidden layer neural networks having this 
type of activation functions can learn finite energy signals. 










Universal Approximators 


265 


Example 9.3.14 The step function +{x) — H{x) 


1, 

0 , 


lf x > 0 . 

^ is dis 

x < 0 


criminatory in L 2 -sense. 

The condition 0 < cp < 1 is obvious. In order to show the rest, we shall use 
Lemma 9.3.12. Let g G L 2 (I n ) and assume 


/ (p(w T x + 0)g(x) dx — 0, \/w G R n , 6 G R, 

t/ i n 

which becomes 



g(x) dx — 0 , 


Vie G R n , 9 G R, 


which implies g — 0, a.e., by Lemma 9.3.12. Hence, the aforementioned step 
function, H(x), is discriminatory in L 2 -sense. 

Consequently, Proposition 9.3.11 implies that any finite energy function 
g G L 2 (I n ) can be approximated by the output of a one-hidden layer percep- 
tron, with enough perceptrons in the hidden layer. 


Example 9.3.15 The logistic function <r(x) = -— is discriminatory in 

1 + e x 

L 2 - sense. 

In order to show this, we set g G L?[l n ) and assume 


/ cr{w T x + 0)g(x) dx — 0, Mw G R n , 9 G R. 

j i n 

Let (J\(x) — &(\{w T x + 0)). Then also 

/ g \(x)g(x) dx = 0, VA G R, 

j i n 

or, equivalently, (G\,g) — 0. Notice that lim G\ — ^f pointwise, where 

A—^oo 

( 1 , if w T x + 6 > 0 

y(x) = < 0, if w T x + 6 < 0 

[ if w T x + 9 = 0. 

Assume for the time being that lim g\ = 7 in L 2 -sense. By the continuity 

A—^00 

of the scalar product we have 



266 


Deep Learning Architectures, A Mathematical Approach 


Since this relation holds for any w and 0, in the virtue of Lemma 9.3.12 it 
follows that g — 0 a.e., which proves that cr is discriminatory in L 2 -sense. 

We are stili left to show that a\ -G 7 in L 2 -sense, as A -G 00 , which is 
equivalent to 

lim / \cr\(x) — 7 (x)| 2 dx — 0 . 

A —Yoo IT 

' J J-n 

Computing the difference piecewise, the integrand becomes by straightfor- 
ward computation 


cr X (x) - tCOI 2 = < 


_|_ e \(w T x+0)^2 ’ 

1 

_|_ e -\(w T x+0)^2 

0 , 


if w T x + 6 > 0 

if w T x + 9 < 0 
if w T x -\- 6 — 0 . 


Then 


lim / \(j\(x) — ^(x)\ 2 dx 

A —yoo J1 

J-n 


— lim 


x ^°°Jh w ,o (l + e x ( w ~ 


dx 



+ lim 


x ^°° JrL- w -e (l + e x ( w ~ 


dx 



= 0 . 


since the last two limits vanish as an application of the Dominated Conver- 
gence Theorem (Theorem C.4.2 in Appendix). Therefore, the logistic fnnction 
is discriminatory in the L 2 -sense. 

Consequently, Proposition 9.3.11 implies that any finite energy fnnction 
g G L 2 (/ n ) can be approximated by the output of a one-hidden layer neural 
network, with enough logistic sigmoid neurons in the hidden layer. 


9.3.3 Learning integrable functions / G L l (I n ) 

The previous theory can be carried over to approximating integrable functions 
/ G L 1 (/ n ). The discriminatory property for the sigmoidal in this case is given 
by: 

Definition 9.3.16 The activation function a is called discriminatory in L 1 - 
sense if: 

(i) g is measurable and bounded; 

(ii) g is sigmoidal in the sense of Definition 2.2.1; 

(iii) if g E L°°(I n ) such that 


g(w t x + 0)g(x) dx = 0, Vie G R n , 9 G R, 














Universal Approximators 


267 


then g — 0. 

The L 1 -version of Proposition 9.3.5 takes the following form: 

Propositiori 9.3.17 Leta be dis criminat ory function in L 1 -sense. Then the 
finite sums of the form 

N 

G(x) — aja(wjx + 9fi, Wj G R n , oy, 9j G R 

3 = 1 


are dense in L 1 (I n ). 

Proof: The proof is similar with the one given for Proposition 9.3.5. We 
assume that the linear subspace of L l (I n ) 

N 

U — {G; G(x) — aja(wjx F 9 fi, Wj G R n , oij, 0j G R, N = 0,1, 2,... } 

3 = 1 

is not dense in L 1 (/ n ). Applying Lemma 9.3.2, there is a bounded linear 
functional F : L 1 (/ n ) —> R such that F fi 0 on L 1 (/ n ) and = 0. Using that 
L°°(I n ) is the dnal of L 1 (/ n ), Riesz theorem (Theorem E.5.2 in Appendix) 
supplies the existence of g G L°°(I n ) such that 


F(/) = f f(x)g(x)dx. 

j i n 


with ||E|| = ||g||i fi 0. Now, the condition F\ u — 0 writes 


as 


! G(x)g(x ) dx = 0, 

In 


MG G G, 


which implies 

/ a(w T x + 9)g(x) dx — 0. 

»/ i n 

Since <r is discriminatory in L 1 -sense, we obtain ^ = 0 , which contradicts 
|F|| fi 0. It follows that U is dense in L 1 (/ n ). ■ 

Since for any g G L°°{I n ) the measure p{x) — g(x)dx belongs to the space 
of signed measures M{I n ), using Remark 2.2.5 of Proposition 2.2.4 it follows 
that any measurable, bounded sigmoidal function is discriminatory in the 
iA-sense. As a consequence of Proposition 9.3.17, we have: 








268 


Deep Learning Architectures, A Mathematical Approach 


Theorem 9.3.18 Let a be any measurable, bounded sigmoidal function (in 
the sense of Definition 2.2.1). Then the finite sums of the form 


N 

G(x) — ajcr(wjx + Oj), Wj G R n , oq, 9j G R 

3 = 1 

are dense in L l [I n ). This means that V/ G L 1 (/ n ) and Ve > 0, there is a 
function G of the previous form such that 


G — /Hl 1 



f(x) | dx < e. 


Remark 9.3.19 1. In particular, a one-hidden layer neural network with 
logistic activation function for the hidden layer and a linear neuron in the 
output layer can learn fnnctions in L l (I n ). 

2. Using the Wiener Tauberian Theorem and properties of Fourier trans- 
form, it can be shown that any function cr in L 1 (R) with a nonzero integral, 
f R a(x) dx jtz 0, is discriminatory in L 1 -sense. 

3. Consequently, the conclnsion of Theorem 9.3.18 is stili valid if the activa¬ 
tion function cr is in L 1 (R) and has a nonzero integral. 


9.3.4 Learning measurable functions / G Ad(R n ) 

A function y — f{x) is measurable if it is the resuit of an observation that can 
be determined by a given body of information. This is usually the information 
that characterizes intervals in R, squares in R 2 , cubes in R 3 , and so on; 
it is customarily denoted by £>(R n ) and is called the Borei 0-held on R n . 
Consequently, all inputs x, which are mapped by / into an interval [a—e, a+e] 5 
are known given the information B (R n ). This fact can be written eqnivalently 
using the pre-image notation as 



e, a + e]) C £>(R n ), 


Va G R, e > 0. 


A measurable function is not necessarily continuous, nevertheless a contin- 
uous function is always Borel-measurable. The reader can find more details 
about measurable functions in the Appendix. 

In this section the target space is the space of Borel-measurable functions 
from R n to R, denoted by A4 n = Al(R n ). Learning a measurable function 
will be done “almost everywhere” with respect to a certain measure, which 
will be introduced next. 

Let fi be a probability measure on the measurable space (R n , £>(R)). This 
can be regarded as the input pattern of the input variable x. It is natural 






Universal Approximators 


269 


not to distinguish between two measurements (measurable functions / and g) 
that differ on a negligible subset of the input pattern (For instance, looking 
at a circle which is missing a few points, we can stili infer that it is a circle). 
Having this in mind, we say that two measurable functions, / and g , are 
y-equivalent if 

y{x G R n ; f(x) 7 ^ g{x)} — 0. 

Sometimes, we write f — g a.e. (almost everywhere). For the sake of simplic- 
ity, identifying equivalent functions, the space of equivalence classes will be 
denoted also by A4 n . We make the remark that all measurable functions in 
this section are considered to be finite almost everywhere, i.e., /i{x; | f(pc) — 
oc} = 0. 

The next metric will measure the distance between equivalence classes 
of measurable functions. More precisely, / and g are situated in a proxim- 
ity if and only if it is a small probability that the differ signihcantly. The 
corresponding distance is dehned by 

= inf{e > 0; /i(|/- g\ > e) < e}, Vf,geM n . (9.3.3) 

The previous assumptions that / and g are finite a.e. make sense here, because 
the difference f — g might be infinite only on a negligible set. We note that 
d^(/, g) — 0 is equivalent to f — g a.e.; the symmetry of d M is obvious and the 
triangle inequality is left as an exercise to the reader. Therefore, (At 77 , 0 ^) 
becomes a metric space. 

The next goal is to find a d^-dense subset, which can be represented 
by outputs of one-hidden layer neural networks. This will be done in a few 
stages. We shall start with a few equivalent convergence properties of the 
metric d M needed later in the density characterization. Recall the notation 
for the minimum of two numbers, x A y = min(x, y ). 


Lemma 9.3.20 Let fj,g G Af 77 . The following statements are equivalent: 

( а ) d MjU) 0 , j ->■ oo; 

( б ) For any e > 0, /i{x; | fj(x) — f(x)\ > e} —> 0, j —> oo; 

( c ) /m» I fj( x ) - f ( x )I A ld[i(x) ->■ 0 , j ->■ oo. 


Proof: We shall show that (a) ( 6 ) (c). 

(a) ( 6 ) Assume that (a) holds. From the dehnition of d M , for any e > 0 

that satisfies g(\fj — f\ > e) < e, we have d^fj, f ) < e. Since d /1 (fj , /) —> 0 , 
we may choose a sequence e = > 0 , such that 




(9.3.4) 



270 


Deep Learning Architectures, A Mathematical Approach 




a b 

Figure 9.4: Graphical interpretation for two inequalities: 
a. e l( e?+00 )(x) <xAl. b. xAl<e + l( €j+00 )(x). 


By contradiction, assume that ( b ) does not hold, so there is an eo > 0 such 
that 

MI fj - /I > e } > e 0- 

Choosing e = we arrive at a relation that contradicts (9.3.4). 

( b ) => (a) If (6) holds, then obviously (a) holds. 

( b ) (c). Let e E (0,1), fixed. The following double inequality can be easily 

inferred from Fig. 9.4 a,b: 


e 1 (e,+oo)(*^0 — ^ A 1 ^ e T l^+oo)^)? x ^ 0. (9.3.5) 

Replacing in (9.3.5) the variable x b y I fj (x) — f(x) |, integrating with respect 
to p and using that 



l \fj{x)-f{x)\>edn{x ) = n{x; I fj(x) - f(x) 



we obtain 

efi{x\ | fj(x) - f(x ) 
e + /j{x: \fj(x ) - /(x) 


>e}< 

>e}> 


1/j'E) “ /0*0 I A 1 <0*0*0 


'ffi 


n 


[ \fj(x) - f(x)\ AI dn(x) 


JR U 

Now, if (c) holds, inequality (9.3.6) implies ( b ). 
If (6) holds, then inequality (9.3.7) provides 


(9.3.6) 

(9.3.7) 


e > lim / 

i Jm™ 


|/j(x) — f(x)\ A 1 dpb(x), Ve > 0, 

























Universal Approximators 


271 


which means that the limit on the right is zero; hence (c) holds. 


Next we shall make a few remarks. 

1 . Statement (a) represents the sequential convergence in the d^-metric. 

2 . Condition (6) States the convergence in probability of fj to /. Assume one 
plays darts and fj(x ) is the position shot in the jth trial given the initial 
condition (mput^ x. r l'he set ^ I fj (x) — f(x)\ > e} represents all inputs x 
which produce shots outside the disk of radius e, centered at the target f(x). 
Consequently, condition (6) States that the chances (probability measured 
using (i) that in the long run the dart eventually hits inside any circular 
target increase to 1. 

3. Part (c) can be written using expectation as E[| fj(x) — f(x) \ A 1] -A 0. Since 


E[| fj{x) - f{x )I A 1] < E[| fj(x) - f(x)W 


it follows that if f j -A / in L i -sense, then (c) holds. Hence, the L i -convergence 
implies convergence in probability. If we define 


Pn(f,g) = E[|/- g\ A 1} = [ \f{x) — g{x) | A 1 dfi(x) 

JR n 


then is a distance on Ai n . The previous resuit States that the topologies 
dehned by the distance functions d M and are equivalent. Consequently, a 
set is d^-dense in A4 n if and only if it is p^-dense. 

The uniform convergence and the d^-convergence are related by the fol- 
lowing resuit: 


Propositiori 9.3.21 Let (fj)j be a sequence of functions in M n that con¬ 
verges uniformly on compacta to f, i.e., \/K C R n compact, 


sup | fj(x) - f(x) 

X^K 


—y 0, j — y oo. 


Then dRfjU) 0, as j ^ oo. 


Proof: By Lemma 9.3.20, part (c), it suffices to shows that Ve > 0, 3N > 1 
such that 



I fj(x) - f(x) I A 1 dn(x) < e. 


Vj > N. 


(9.3.8) 


This will be achieved by splitting the integral in two parts, each being smaller 
than e/2. 

Step 1. We shall construet a compact set K such that p{W l \K) < -. 

Lj 








272 


Deep Learning Architectures, A Mathematical Approach 


Denote by B( 0, k) the closed Euclidean ball of radius /c, centered at the origin. 
Then R n can be written as a limit of an ascending sequence of compact sets 
as 

R n = |^J B( 0, k) — limsup B( 0, k). 


k> i 


k 


From the sequential continuity of the probability measure /i we get 

1 — /i(R n ) — lim /i(E>( 0 , k)). 

k —^oo 

Therefore, for k large enough, /jb(B(0,k)) > 1 — e/2. Consider the compact 
K — B( 0, fc), and note that /i(R n \K) < e/2. 

57ep Using Step 1 we have the estimation: 

| fj(x) — f(x) | A 1 d/a(x) < / 1 dfi(x) — /a(W l \K) < -. 

JR n \K 2 


Since sup^^ \fj(x) - f(x) 


R n \K 

0, we can hnd N > 1 such that 


sup I fj(x) - f(x)\ < -, Vj > N. 

xeK z 


Therefore 


/ I fj(x) - f(x) I A 1 d/j,(x) < f sup 1 / 7 ( 2 ) - /(2) 1^(2) < T(-?0 < 7 
J K JK xeK z z 


Combining the last two inequalities yields 


[ \fji x ) - fi x )\ A 1 ^( 2 ) = [ \fj{x) - f(x )\A ldn(x) 
JR n JK 


+ I I fj(x) - f(x) I A ldn(x) 

'R n \K 


e e 

<—|— = e. 
“2 2 


V/ > iv, 


which proves the desired relation (9.3.8). ■ 

The next resuit extends Theorem 9.3.6 and will be used in the proof of the 
main resuit shortly. 


Theorem 9.3.22 Let a : R -A R be any arbitrary continuous sigmoidal func- 
tion (in the sense of Definition 2.2.1). Then the finite sums of the form 

N 

G{x) — aja(wjx + 0j), Wj G R n , aq, 0j E R, N = 1 , 2 ,... 
j=1 

are uniformly dense on compacta in C(R n ). This means that for any given 
function f G C(R n ), t/iere zs a sequence (G m ) m of functions of the previous 
type such that G m -A f uniformly on any compact set K in R n , as m —> 00 . 









Universal Approximators 


273 


Proof: We shall follow a few steps. 

Step 1: Show that if f G C(R n ) and K C R n is a compact set, then there is a 

sequence of functions of the previous type that converges uniformly 

to f on K. 

We note that Theorem 9.3.6 works also in the slightly general case when C{I n ) 
is replaced by C{K ), with K C R n compact set, the proof being similar. This 
means that for any fixed compact set K, the finite sums of the form 

N 

G(x) — ajcr(wjx + Oj), Wj G R n , aq, Qj G R 

3 = 1 


are dense in C(K). Since for any / G C{ R n ) we have G C(K ), then there 

is a sequence (G^) m of functions of the previous type such that G > f 
uniformly on K. 

Step 2: There is an ascending sequence of compact sets, ( Kj)j , such that 
Kj C Kj+\ and Kj /* R n , j —> oo. 


It sufhces to define the sequence as Kj — {x G R n ; 




< j}- 


Step 3: Use a diagonalization procedure to find the desired sequence. 


For each compact set Kj defined by Step 2 , we consider the sequence (G^ ) m 

defined by Step i, so GJi —> f on Kj, as m —> oo. Applying this resuit on 
each compact Kj, we obtain the following table of seqnences: 


<y i} 

Lt 2 


Lt 4 

— > / on 

G { P 

Lt 2 


/nr(2) 

Lt 4 

^ f on K 2 

g { 2 

/nr(3) 

Lt 2 


/nr(3) 

Lt 4 

^ / on K% 

gP 

r (4) 

Lt 2 


r (4) 

Lt 4 

— > f on 


The sequence in each row tends to / uniformly, on a given compact set. 
Since the compacts (. Kj)j are nested, we have that in fact > /, uniformly 

on any Kj, with j < p. Using a diagonalization procedure, we construet the 

sequence = G y k , k > 1. We shall show next that for any K compact we 
have Gr —> /, uniformly on K. 

For this we select the smallest integer jo such that we have the inclusions 
K C Kjg d Kj 0+ 1 C Kj 0+2 C • • •. Using the previous property, the sequence 
( Gk)k>j 0 converges to /, uniformly on Kj 0 , and hence on K. 

■ 

The following resuit States that any measurable function can be approx- 
imated by a continuous function in the probability sense. 







274 


Deep Learning Architectures, A Mathematical Approach 



a 


b 


Figure 9.5: a. The graph of y — h(x). b. The graph of y — \ f(x) — h(x) 


Propositiori 9.3.23 The set of continuous functions C(R n ) is d^-dense in 

M n . 

Proof: Let / G A4 n be a measurable function. Fix e > 0. We need to show 
that there is a continuous function g G C( R n ) such that d^f^g) < e. This 
shall be achieved in a few steps: 

Step 1. Show that there is an N large enough such that 


/jl{x] \f(x)\ > N} < e/2. 


For this we consider A n — {x; \f(x)\ > n}. Since A n +1 C A n and A n — 
/ _ 1 (n, + oo) U / _1 (— oo, — n) G S(R n ), then ( A n ) n is a descending sequence 
of measurable sets with limit A n \ f] n A n — {x;\f(x)\ — oo}. From the 
sequential continuity and the fact that / is finite a.e. we get p(A n ) -G 0 , n —> 
oo. Therefore, we can choose N large enough such that p{\f\ > N} < e/2. 

Step 2. Show that p M (/, h) < 4 where h{x) — f{x) l|j| < ^v(x). 

This means 

/ | f{x) — h{x) | A 1 dfi{x) < -. 

JR n 2 

If 1 |/|<W denote the indicator function of the set {x; |/(x)| < 7V}, then 

h(r \ = n (r \ = f f ( x )> if i/(*)i < N 

K x ) f |/|<iv( x ) | 0) if |/( x )| > 


see Fig. 9.5 a. The difference becomes 


f(x) - h(x) 


0, if |/(x)| < N 

\f{x)\, if j/h)j > N, 




























Universal Approximators 


275 


see Fig. 9.5 b. Note that by Step 1, this is nonzero only on a set with measure 
arbitrary small. Integrating, using that N > 1 and the inequality obtained 
at Step 1, we obtain 



/ — h\ A 1 dp 


f 

/-ft Ald/i+ / / — h A 1 dp 

'\f\< N 


•A/I^V 

f* 

/ 

0 A 1 dp T j 

|/| A 1 dp 

'l/l<JV 

/» 

J 

\f\>N 


1 dp = p{\f\ > N} < 


Step 3. There is a continuous function, g G C(R n ), with p^h^g) < -. 

Lj 

The function h(x) — is measurable and bounded, with \h(x)\ < N. 

It is known that h(x) is the limit of a sequence of simple functions («§&)&, see 
Appendix. We may further assume that |s&(x)| < iV, for k large enough. Since 
p(W l ) = 1 and Sk(x) -A h(x), the Bounded Convergence Theorem (Theorem 
C.4.3 in Appendix) implies 


| h{x) — Sk(x) \ A 1 dp(x) -G 0, k -A oo. 


M 


n 


In e-language, there is a /cq > 1 such that 


Pn(h, Sk) < - 


k > k 0 


(9.3.9) 


Write the simple function as Sk(x) — Y2i=i a AA l (%) and consider compacta 
Ki C Ai such that p(Ai\Ki) < AA and the continuous functions 


9i(X) = 


1 , if x E Ki 

0 , x ^ Ai. 


Construet the continuous function g(x) — Yli=i a i9i( x ) G C(R n ). We have 
the estimation 


M 


n 


k r> 

g{x) - S k {x)\ A 1 d, u(x) < Y, \otj\ / \gi(x) - l Ai (x)\Aldp(x) 

• i JR n 


i— 1 
k 


k 


sE 


OLi 


i=1 - Ai\Ki 

k 


1 dp(x) = ^ \ai\/j,(Ai\Ki) 


i=1 


< 


AkN 


E 

i—1 


OU 


4 













276 


Deep Learning Architectures, A Mathematical Approach 


or, equi valent ly, 


s k ) < ^ 


k > fco- (9.3.10) 

Triangle inequality together with relations (9.3.9) and (9.3.10) now provides 


PniK 9 ) < Pn(h, s k ) + Pn(g, s k ) < | | 


Step 4 • Finishing the proof. 

Using the triangle inequality for the distance and the estimations proved 
in Step 2 and Step 3, we have 

Pfiif, 9 ) < Pti(f, h) + p^(h, g) < | | = e. 

It follows that C(R n ) is p M -dense in *A/l n . By Lemma 9.3.20, remark 3, C(R n ) 
is also d^-dense in Ai n , which is the desired conclusion. 


The following theorem is the main resuit of this section. It States that 
any measurable function on R n can be learned to any desired accuracy by a 
one-hidden layer neural network (with a linear neuron in the output layer), 
regardless of the space dimension n and input probability measure fi. 


Theorem 9.3.24 Let a : R —» R be a continuous sigmoidal function (in 
the sense of Definition 2.2.1). Then for any integer n, and every probability 
density /a the finite surus of the form 

N 

G(x) — ajcr(wjx + 9j), Wj G R n , oy, Qj G R 

3 = 1 

are d^-dense in M n . 

Proof: Let / G M n and fix e > 0. Consider the set of functions 

N 

U — {G{x) — ajcr(wjx + Qj), Wj G R n , ay, Qj G R, N = 1,2,... }. 

3 = 1 

From Proposition 9.3.23 there is a continuous function g G C(R n ) such that 
dn(f,g) < e/2. By Theorem 9.3.22 there is a sequence (Gk) C U such that 
Gr converges uniformly on compacta to g. Applying Proposition 9.3.21 yields 
that d^(G k ,g) -G 0, k -G oo, so d^(G ko ,g) < e/2, for some large enough k^. 
Triangle inequality now provides 

rf/i(/, G k0 ) < dn(f,g) + dn(G ko ,g) < - + - = e, 
which shows that U is d M -dense in Ai n . ■ 


Universal Approximators 


277 


Remark 9.3.25 1. The previous resuit can be stated by saying that any one- 
hidden layer neural network, with sigmoidal neurons in the hidden layer and 
linear neuron in the output, can learn any measurable function. 

2. With some tedious modihcations it can be shown that the conclusion of 
Theorem 9.3.24 holds true even if the activation function is a squashing func¬ 
tion, see Dehnition 2.3.1. Consequently, measurable functions can be learned 
by one-hidden layer neural networks of classical perceptrons. 

3. The approximation of measurable functions will be used later when neu¬ 
ral networks will learn random variables. In this case the measure /i is the 
distribution measure of the input random variable. 


9.4 Error Bounds Estimation 


We have seen that given a target function f(x) in one of the classical spaces 
encountered before, C(I n ), L 1 (/ n ) or L 2 (/ n ), there is an output function 
G{x) produced by a one-hidden layer neural network, which is situated in a 
proximity of f(x) with respect to a certain metric. It is expected that the 
larger the number of hidden units, the better the approximation of the target 
function. One natural question now is: how does the number of the hidden 
units influence the approximation accuracy? 


The answer is given by the following resuit of Barron [12] that we shall discuss 
in the following. Assume the target function / has a Fourier representation 
of the form 

f{x) — [ e luJ x f{uS) duo, x E R n , 


M 


n 


with uof{uo) integrable; this means the frequencies f(uo) decrease to zero fast 


enough for ||u;||i large, where 


UO 


l 


— 


/ 1 

UO 

il/M 

/R n 




Denote Cf — 


which the function / oscillates. Since 


\f(uo)\ duo. The constant Cf measures the extent to 


I d x J{x) 


R 


cjje 


T 
ILJ X 


/M dtu 


n 


</ 

UOj 

/M 

JR n 




then all functions /, with Cf finite, are continuously differentiable. 

The accuracy of an approximation Gk(x) to the target function f(x) is 
measured in terms of the norm on T 2 (/i, / n ), where /i is a probability measure 
on the hypercube, which describes the input pattern 


|| f-G k \\ 2 = j \f(x) - G k (x)\ 2 d/i(x). 

J I n 

This is also regarded as the integrated square error approximation. The rate 
of approximation is stated in the next resuit: 















278 


Deep Learning Architectures, A Mathematical Approach 


Theorem 9.4.1 Given an arbitrary sigmoidal activation function (in the 
sense of Definition 2.2.1), a target function f with Cf < oo, and a prob- 
ability measure /i on I n , then for any number of hidden units, N > 1, there 
is a one-hidden layer neural network with sigmoidal activation function in 
the hidden neurons and linear activation in the output neuron, having the 
output Gr such that 

Roughly speaking, for each 100-fold increase in the number of hidden 
units, the approximation earns an extra digit of accuracy. From this point of 
view, the approximation error resembles the error obtained when applying 
the Monte Carlo method. 


9.5 Learning g-integrable functions, / E L 9 (M n ) 


For any q > 1, we define the norm 





\f(x)\ q dx 


We recall that the space L q (R n ) is the set of Lebesgue integrable functions 
/ on R n for which \\f\\ q < oo. 

A function / : R n -o R is called compactly supported if there is a compact 
set K C R n such that / vanishes outside of K , i.e., 


f{x) = 0, VrG R n \K. 


We shall denote the set of continuous, compactly supported functions on R n 

by 

Co(R n ) = {/ : R n —> R; /continuous, compactly suported}. 


We hrst note the inclnsion Co(R n ) C lA(R n ). This follows from the following 
argument. Let K be a compact set such that { x ; f(x) ^ 0} C K. Since / is 
continuous on K , then / is bounded, namely, there is a constant M > 0 such 
that \f(x)\ < M, for all x G K. The previous inclnsion follows now from the 
estimation 




\f(x)\ q dx 



f(x)\ q dx < M q vol(K ) < oo. 


where we used that any compact set is bounded and hence has a finite volume. 

In order to be able to approximate functions in L 9 (R n ) by continuous 
piecewise linear functions we need two density results: 











Universal Approximators 


279 



Figure 9.6: A piecewise continuous function £{x) approaches uniformly the 
compact supported continuous function g(x). 


Propositiori 9.5.1 The family of compact supported continuous functions 
on R n is dense in L 9 (R n ). 

This means that given / E L 9 (R n ), then for any e > 0, there is g E Co(R n ) 
such that 

II/ -g\\q < e- 

Even if there is no obvious proof for this resuit, we shall include next a rough 
argument. For more details, the reader is referred to the book [105]. 

If, by contradiction, Co(R n ) is not dense in L 9 (R n ), then there is a nonzero 
continuous linear functional F : L q ( R n ) —> R such that F|c 0 (M n ) — 0. By 
Riesz’ theorem of representation, there is g E L p { R n ), with 1/p+l/q — 1, such 
that F(f) — f f g , for all / E L q ( R rl ). Then f <f>g — 0, for all f E Co(R n ), fact 
that implies g — 0, a.e. Therefore, F — 0 on L 9 (R n ), which is a contradiction. 
The second density resuit is: 


Propositiori 9.5.2 The family of continuous, piecewise linear functions on 
R n is dense in the set of compact supported continuous functions on R n . 

This means that given g E Co(R n ), then for any e > 0, there is a continuous 
piecewise linear function i such that 


sup | g(x) — £{x) 

x^K 


< e, 


where K is the support of g , i.e., the closure of the set {x]g(x) ^ 0}. The 
proof considers a partition of K in a finite set of polyhedra and £ as an affine 
function over each polyhedron. When the partition gets refined the function 
£ gets closer to g. By a Dini-type theorem, Theorem 7.1.1, it follows the 









280 


Deep Learning Architectures, A Mathematical Approach 


uniform convergence on K. This can be easily seen in the more simplistic 
one-dimensional case in Fig. 9.6. 

The following resuit is taken from Arora et al. [10]. 

Theorem 9.5.3 Any function in IA(R n ), (1 < q < oo) ; can be arbitrarily 
well approximated in the || • \\ q by a ReLU feedforward neural network with at 
most L — 2([log 2 n\ + 2) layers. 


This means that if / E IA(R n ), then for any e > 0 there is a ReLU-network, 
A4, such that ||/ - fj^ e \\ q < e. 

Proof: Let e > 0 be hxed. By Proposition 9.5.1 there is a function g E Co(M n ) 
such that ||/ — g\\ q < e/2. Let K be the support of the function g. Then by 
Proposition 9.5.2 there is a continuous piecewise linear function £ such that 


sup \g — £\ < ( 
K k 


2 vol(K) 



Then integrating, we obtain ||g — £\\ q < e/2. By Proposition 10.2.7 there is a 
ReLU-network, A/, that can represent exactly the linear function 7, namely, 
£ = fjg . Applying the triangle inequality to the norm 


q 


we obtain 


f - fNMq < 

< 


f ~9 


q 


+ II 9 



l 



fNe 


s . 




=0 


q 


which ends the proof. 


Remark 9.5.4 The results of this section also hold true if the space L q (R n ) 
is replaced by the space L q (K ), with K compact set in R n . 


9.6 Learning Solutions of ODEs 

Neural networks can also learn Solutions of ordinary differential equations, 
provided the equations are well posed , namely, if the following properties are 
satisfied: 

(i) The solution exists; 

(ii) The solution is unique; 

(iii) The solution is smooth with respect to the initial data. 

The case of the first-order ODEs is covered by equations of the type 

y'(t) = f{t,y{t)) 

y(to) = 2 / 0 , t e[to,to + e), 



















Universal Approximators 


281 


for some e > 0. It is known that if /(£, •) is Lipschitz in the second variable, 
then the aforementioned ODE has a unique solution for e small enough. For 
simplicity reasons, we assume that / : [to, to + e ) x R -G R, namely, y is 
considered one-dimensional. Similar results hold if y G R m , case in which the 
above equation becomes a System of ODEs. 

We shall construet a feedforward neural network, whose input is the con- 
tinuous variable t G [to,to + 6 ) an d its output is a smooth fnnetion, (f>o(t), 
which approximates the solution y(t) of the previous equation. The parame- 
ters of the network were denoted by 9. 

Consider the cost function 


C{9) = Wfio - f{-Ae)\\l + \\<f>e(to) - Vo 

‘to+ e 


2 

2 


[<t>e(t) ~ f(tAe(t))] dt+ (ct>e(t 0 ) - y 0 ) , (9.6.11) 


to 


and let 0* = argmin(7(0). Then the network optimal output, </>#*, starts with 
a value o) close to yo and evolves close to the solution starting at yo- 
Since the previous cost function is difficult to implement, we shall consider 
the equidistant division to < t\ < • • • < t n — to + e, with At = e/n, and 
associate the empirical cost function 


71—1 


m {«,}) = y, 


k =0 


<4(4+i) - <4(4) 

At 


i 


-/(4, <4(4)) At + (<j) w (t 0 ) -y 0 ) 


One possible approach is to consider a neural network with one hidden layer, 
N hidden neurons, and linear activation in the output. Conseqnently, the 
output is given by 


N 

(f)Q (t) = aj<j(wjt + bj), 9 — (w, 6, a) G R^ x R^ x R^. (9.6.12) 

3 = 1 


The differentiability of (j>o(t) is assured by that of the logistic function a. Now 
we have two choices: 

(i) Substitute relation (9.6.12) into the empirical cost function C(9\ {U}) and 
minimize with respect to 9 to obtain = argmin(7(0; {U}). Then ) is 
an approximation of the solution y(t). 

(ii) Compute the gradient 

rt o+ e 

V e C{0) = 2/ (4>e-f(t,4> e ))(Ve4>' 0 -V y f(t,4> e ) T V e 4>e)dt 

Jto 

+2 (4>e(t 0 ) - y o )V(f> 0 (t o ) 









282 


Deep Learning Architectures, A Mathematical Approach 


and substitute relation (9.6.12), see Exercise 9.8.11. Then apply the gradient 
descent to construet the approximation sequence 

e j +i = e j -\ v w c{9), 

for a certain learning rate A > 0. 

9.7 Summary 

The chapter answers the qnestion of what kind of fnnetions can be approx- 
imated using multiple layer networks and is based on the approximation 
theory results developed by Funahashi, Hornik, Stinchcombe, White, and 
Cybenko in the late 1980s. 

The results proved in this chapter show that a one-hidden layer neural 
network with sigmoid neurons in the hidden layer and linear activation in the 
output layer can learn continuous functions, integrable functions, and square 
integrable functions, as well as measurable functions. The quality of neural 
networks to be potential universal approximators is not the specihc choice 
of the activation function, but rather the feedforward network architecture 
itself. 

The price paid is that the number of neurons in the hidden layer does not 
have an a priori bound. However, there is a resuit stating that for each extra 
digit of accuracy gain in the target approximation, the number of hidden 
neurons should increase 100-fold. 

The approximation results work also if the activation function is a ReLU 
or a Softplus. However, the proof is not a straightforward modification of 
the proofs provided in this chapter. The trade-off between depth, width and 
approximation error is stili an active subject of research. For instance, in 2017, 
Lu et al. [79] proved an universal approximation theorem for width-bounded 
deep neural networks with ReLU activation functions that can approximate 
any Lebesgue integrable function on n-dimensional input space. Shortly after, 
Hanin [52] improved the resuit using ReLU-networks to approximate any 
continuous convex function of n-dimensional input variables. 

Neural networks can also be used to solve numerically hrst-order differ- 
ential equations with initial conditions. 

9.8 Exercises 

Exercise 9.8.1 It is known that the set of rational numbers, Q, is dense in 
the set of real numbers, R. Formulate the previous resuit in terms of machine 
learning terminology. 


Universal Approximators 


283 


Exercise 9.8.2 A well-known approximat ion resuit is the Weierstrass Appro- 
ximation Theorem: Let f be a continuous real-valued function defined on 
the real interval [a, b\. Then there is a sequence of polynomials P n such that 
sup | f{pc) — P n (x) | -G 0, as n —> oo. 

[a,b\ 

Formulate the previous resuit in terms of machine learning terminology. 

Exercise 9.8.3 (separatiori of points) Let xo,x\ be two distinet, non- 
collinear vectors in the normed linear space X. Show that there is a bounded 
linear functional L on X such that L{x o) = 1 and L{pc i) = 0. 

Exercise 9.8.4 Let xo, x\ be two distinet, non-collinear vectors in the normed 
linear space X. Show that there is a bounded linear functional L on X such 
that L(x o) = L{pc i) = 1/2, with ||L|| < ? where 5i is the distance from 

Xi to the line generated by the other vector. 

Exercise 9.8.5 Given two finite numbers, a and 6, show that there is a 
unique finite signed Borei measure on [a, b] such that 

rb rb 

/ sin td/i{t) — / sin td/i{t). 

J a J a 

Exercise 9.8.6 Let P([0,1]) be the space of polynomials on [0,1]. For any 
P E V([ 0,1]) define the functional 

L(P) — uo A a\ + • • • + a n , 

where P(x ) = ao + a\x + • • • + a n x n , ai G M. 

(a) Show that L is a linear, bounded functional on P([0,1]). 

( b ) Prove that there is a finite, signed Baire measure, fi G M([ 0,1]), such that 

i 

P(P) d[l(x) — Uo X ^1 + • • • + VP G ^([0? !])• 

Exercise 9.8.7 (a) Is the tangent hyperbolic function discriminatory in the 
L 2 -sense? What about in the lA-sense? 

( b ) Show that the function ct>{x ) = e ~ x2 is discriminatory in the iA-sense on 

M. 

Exercise 9.8.8 (a) Write a formula for the output of a two-hidden layer 
FNN having N\ and N 2 number of neurons in the hidden layers, respectively. 

( b ) Show that the outputs of all possible two-hidden layer FNNs with the 
same input form a linear space of functions. 




284 


Deep Learning Architectures, A Mathematical Approach 


Exercise 9.8.9 An activation function <j is called strong dis criminat ory for 
the measure /i if 



f){x) d[i(x ) = 0 , 


V/ G C(I n ) 



(а) Show that if a is strong discriminatory then it is discriminatory in the 
sense of Dehnition 2.2.2. 

( б ) Assume the activation functions of a two-layer FNN are continuous and 
strong discriminatory with respect to any signed measure. Show that this 
FNN can learn any continuous function in C(I n ). 


Exercise 9.8.10 Show that the metric d M dehned by (9.3.3) satisfies the 
triangle inequality: 


d/j,(f,g) < d^(/, h) +d^(h,g), V/, 5,/1 E A4(R n ). 

Exercise 9.8.11 Find <^(£), A7 w (j)Q{t), and for the expression of 

4>o(t) given by (9.6.12). 




® 

Check for 
updates 


Chapter 10 

Exact Learning 


By exact learning we mean the expressibility of a network to reproduce 
exactly the desired target function. For an exact learning the network weights 
do not need tuning; their valnes can be found exactly. Even if this is unlikely 
to occur in general, there are a few particular cases when this happens. These 
cases will be discussed in this chapter. 


10.1 Learning Finite Support Functions 


We shall start with the simplest case of finite support functions. The next 
resuit States that a one-hidden layer neural network, having the activation 
function in the hidden neurons given by the Heaviside function, H{x), and 
one linear neuron in the output layer, has the capability to represent exactly 
any function with finite support on R r . More precisely, we have: 

Propositiori 10.1.1 Let g : R r -A R be an arbitrary function and consider 
a set S — {xi,... ,x n } of distinet points in W . Then there is a function 
G(x ) = i OLiH{wfx — 6 i), with G R, such that 

G( x j) — g{xj), j = 1 ,... ,n. 


Proof: The proof will be done in two steps. 

Step 1 : Assume r — 1 . In this case, the points are real numbers, so we 
may assume the order x\ < X 2 < • • • < x n . Choose thresholds 6j such that 
6 \ < x\ < 62 < X 2 < ^3 < X 3 < • • • < x n -\ < 9 n < x n . It is easy to see that 


f 1, if Oj < Xi (or j < i) 
\ 0, otherwise. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_10 


285 



286 


Deep Learning Architectures, A Mathematical Approach 


Consider the function G(x) — Y17 =i a iH(x — di) and impose conditions 
G{xj ) = g(pcj ), j = 1,..., n. This leads to the following diagonal linear 
system satisfied by a'- s : 

«i — sO^i) 

«i + OL 2 — g(x 2 ) 

• • • • • • 

«1 + • • • + Qt n = g^n)-) 

with solution «j = g(xj) — g(xj- 1 ), for 2 < j < n. 

Step 2: Assume r > 2. The idea is to reduce this case 
the finite number of hyperplanes, Hjj — {p;p(xi — xj) — 
denotes the inner product between vectors p and Xi — Xj , 
re, which does not belong to any of the hyperplanes 

w E R fc \ (J Hij. 

Then the inner prodncts wx{ are distinet real nnmbers, which can be ordered, 
for instance, as 

wx 1 < wx 2 < • • • < wx n . 

Using Step 1 we can construet the weights {oy}, thresholds {6j} and consider 
the function 

n 

G(x) — ol{H(wx — Oi), 

i =1 

which via Step 1 satisfies G{xj) — g(xj ), for 1 < j < n. ■ 

Remark 10.1.2 We shall make a few remarks. 

(i) We have seen in section 5.2 that a single perceptron cannot implement 
the function XOR , which is a function satisfying g(0, 0) = 0, g( 0,1) = 1, 
g( 1, 0) = 1, and g( 1,1) = 0. The previous proposition infers that a neural with 
one hidden layer and 4 computing units in the hidden layer can implement 
exactly the XOR function. In fact, only 2 neurons in the hidden layer would 
suffice, see Exercise 10.11.1. 

(ii) The number of neurons in the hidden layer is equal to the number of 
data points, n. Therefore, the more the data, the wider the network. 

(iii) The weights and biasses are not unique. There are infinitely many good 
choices for the parameters such that the network learns the function g. 

(' iv ) The proposition stili holds true if the step function H(-) is replaced by a 
squashing function iJj (see section 2.3) that achieves 0 and 1. The proof idea 
is stili the same. 


to Step 1. Consider 
0}, where p(xi — Xj) 
and choose a vector, 


Exact Learning 


287 


(v) With some more effort the proof can be modified to obtain the resuit valid 
for arbitrary squashing functions. In particular, the resuit holds true if H(x) 
is replaced by the logistic function, <j{x). We make here the following relation 
with the message of section 8.6. In this case the network “memorizes” the values 
of the arbitrary function g, without having the capacity of generalizing to any 
underlying continuous model. From this point of view, the network behaves like 
a look-up table, and hence it is of not much interest for machine learning. 

Replacing g{xf) by yi, the resuit can be reformulated as in the following. 

Corollary 10.1.3 Given n data, ( x^yi ) G R r x R, 1 < i < n, there is a 
function G{x) — x — di), with G M. k , ai, di G R, such that 

G{xj) = yj, j = l,...,n. 

This means that the neural network Stores the data in its parameters. 
There are n weights cq, n biasses dj and n matrices Wj of type r x r, in 
total 2 n + nr 2 real number parameters, which can store the data, which is 
equivalent to a vector of length n(r + 1). 

10.2 Learning with ReLU 

The ReLU function can be extended to vectors x G R n using an action on 
components as 

ReLU (x) = (ReLU (aq),..., ReLU (x n )) , 
where x T = (aq,..., x n ). 

We also recall that an affine function A : R n -q R m is given by 

A(x) = (Ai(x), . . . ,A m (x)), 

where Aj(x) — Ylk=i w kj x k + bj is a real-valued affine function with real 
coefficients w^j and bj. 

Definitiori 10.2.1 A ReLU-feedforward neural network with input x G R n 
and outputy G R m is given by specifying a sequence of L — 1 natural numbers, 

..., £l-i, representing the widths of the hidden layers, and a set of L 
affine functions A\ : R n -G R^ 1 , Ai : R^- 1 -q R^q for 2 < i < L — 1, and 
Al : R^- 1 -q R m . The output of the network is given by 

f — Al o ReLU o Al-i o ReLU o • • • o A^ ° ReLU o A\, 

where o denotes function composition. 

The number L denotes the depth of the network. The width is dehned by 
w — max{q,..., £ L—i }• The size of the network is s = + • • • + £l- i 5 be., 
the number of hidden neurons. 


288 


Deep Learning Architectures, A Mathematical Approach 


10.2.1 Representation of maxima 


In this section we show that the function /(x i, ..., x n ) — max{xi,. .., x n } 
can be represented exactly by a neural network having ReLU activation func- 
tions in the hidden layers and a linear activation function in the output layer. 
We also show that the depth of the network increases logarithmical with the 
input size n. 

We start with a few notations and some elementary properties. Let x + = 
max{x, 0} and x~ — min{x, 0} denote the positive and the negative part of 
the real number x, respectively. We note that we have decompositions x — 
x + + x~ and \x\ — x + — x~. We also have x + — — (—x) - . Using the formula 


max{xi,x 2 } = -(xi + x 2 + \x\ - x 2 |) 


and the fact that ReLU ( x ) = x + , we can write the maximum of two numbers 
as a linear combinations of ReLU functions as in the following: 


max{xi,x 2 } — 


-(xi + x 2 + |xi - x 2 1) 

1 

- ((xi + X 2 ) + + (xi + X 2 ) _ + (xi - x 2 ) + - (xi - X 2 )~) 

1 

- ((xi + x 2 ) + - ( —Xi - X 2 ) + + (xi - X 2 ) + + ( —Xi + x 2 )~^) 

1 1 

-ReLU( 1 • x\ + 1 • x 2 ) — -ReLU (—1 • x\ — 1 • x 2 ) 

1 1 

+ -ReLU( 1 • x\ — 1 • x 2 ) + -ReLU (—1 • x\ + 1 • x 2 ). 


Therefore 


( 10 . 2 . 1 ) 


max{xi,x 2 } = XjReLU(wijXi + re 2j x 2 ), 

3 = 1 

which means that max{xi,x 2 } can be represented exactly by a feedforward 
neural network with one hidden layer. The activation in the 4 hidden neu- 
rons is given by the ReLU function, while the output neuron is linear, see 
Fig. 10.1. The weights from the input to first layer are 


wn — 1, w\ 2 — —1, rci3 — 1, a;i4 — —1 

W21 — 1, W22 — — 1, ^23 —1, Wl4 1? 

and the weights from the hrst-to-second layer are given by 


Ai — A 3 — A4 



1 

2 


All biasses are zero. In the following L denotes the depth of a network, which 
has L — 1 hidden layers. 







Exact Learning 


289 



y=max(x l x 2 ) 


Figure 10.1: A one-hidden layer network learning exactly the maximum of 
two numbers. 


Propositiori 10 . 2.2 The function f(x i,..., x n ) — maxjxi,..., x n } can be 
represented exactly by a ReLU-feedforward neural network (as in Definition 
10 .2.1) having 


f 2 k, if n — 2 k 

1 2 ( j_log 2 nj + 1 ), ifny 2 k . 


Proof: The proof will be done in a few steps. 

Step 1: We prove the statement in the case n — 2 k , with k > 1. This will be 
done using induet ion over k. 

The case k — 1 follows from the previous discussion, which shows that 
max{xi,X 2 } can be represented exactly by formula (10.2.1). In this case the 
network has depth L — 2. 

In the case k — 2 we write 


max{xi, X 2 , £ 3 , £ 4 } — max{max{xi, X 2 }, max{x 3 , £ 4 }}. 

Then a — maxjxi, £ 2 } and b — max{£ 3 , £ 4 } are represented by a neural net¬ 
work with one hidden layer each. We put these networks in parallel and add 
two more layers to compute max{a, 6 }, see Fig. 10.2. The resulting network 
has L — 2 + 2 = 4, which verihes the given relation. 

The induction hypothesis States that if n — 2 k then 


f (£15 • • • 5 x^k ) — max{£i,... j x<2k } 




290 


Deep Learning Architectures, A Mathematical Approach 



Figure 10 . 2 : A neural network learning exactly the maximum of 4 numbers. 
The 8 neurons in the first layer can be considered as only 4 overlaying neu- 
rons. 


can be represented by a network with L & = 2 k. Since 

/ • • • 5 ^ 2 ^+ 1 ) m ax{/ ( X \, . . . , X <2 fc ) 5 f (%2 kj rl % 2 kj r 2 k ) } * 

then by the induction hypothesis f(x i,..., x 2 k ) and /(x 2 fc +1 ,..., x 2 k +2 k) can 
be represented by a network each having + 1 layers. We put these net- 
works in parallel and add two more layers to compute f(x i,..., x 2 k+ 1 ). The 
resulting network has T/c+i = T 2 — 2k 2 — 2{k + 1), which ends the 
induction. 

Step 2: We prove the statement in the case n ^ 2 k , for k > 1. 

In this case we have 

2 k <n< 2 k+1 


5 




Exact Learning 


291 


for k — [log 2 ti\ . We complete the set {xi,... 
equal to the next power of 2 by appending 2 /c+1 


, x n } to a set with Cardinal 
— n numbers equal to x n 


%n — %n-\- 1 — %n-\- 3 — ’ — X 2 k+i . 


Since 


max{xi,..., x n } — max{xi,..., x n ,... x 2 k+ 1 } 


applying Step 1 we obtain a network of depth = 2 (k + 1) = 2([log 2 n\ + 
1), which learns the maximum in the right side. This is the relation provided 
in the conclnsion. ■ 


Remark 10.2.3 It is worth noting that a similar resuit holds for representing 
the minimum of n numbers. The proof is similar, using the formula 


min{xi, x 2 } 


1 

2 


(xi + X2 




The following resuit shows that for representing max{xi, x 2 } the width of 
the hidden layer can be reduced to only 2 neurons if one of the input variables 
is nonnegative and a ReLU activation is applied to the output neuron. 


Propositiori 10.2.4 There is a feedforward neural network with input of 
dimension 2, one hidden layer of width 2, output of dimension 1 and hav- 
ing ReLU activation functions for ali neurons, which represents exactly the 
function 

f(x i,x 2 ) = max{xi,x 2 }, Mx\ E M, x 2 E R+. 

Proof: We consider the neural network given by Fig. 10.3. All biasses are 
taken equal to zero and the weights are chosen to be 

w n = r w n = o, wii = -r P2 = r = 1, = 1 - 

Using these weights we construet the affine functions A\ : M 2 — > M 2 and 

A 2 : M 2 —^ R 


Ai(xi,x 2 ) = (xi — x 2 , x 2 ), A 2 (u, v) = u + v. 





292 


Deep Learning Architectures, A Mathematical Approach 



Figure 10.3: A neural network learning exactly the 


maximum of 2 numbers. 


Since all neurons have a ReLU activation function, the input-output mapping 
of the network is computed as in the following: 


f(x l,x 2 ) 


(ReLU o A 2 o ReLU o A\)(x\, x%) 

(ReLU o A 2 o ReLU)(x\ — X 2 , ^ 2 ) 

(ReLU o A 2 ) (ReLU(x\ — X 2 ), ^ 2 ) 

ReLU ( ReLU(x\ — X 2 ) + £ 2 , 0) 
ma x{ReLU(x\ — X 2 ) + £ 2 , 0} 
max{max{xi — X 2 , 0 } + £ 2 , 0 } 

max{max{xi, X 2 }, 0} = max{xi,X 2 }, Vxi G R, X 2 G R+. 


The last identity holds since max{xi,X 2 } >^ 2 ^ 0 . 


10.2.2 Width versus Depth 

The next resuit deals with a trade-off between “shallow and wide” and “deep 
and narrow” for ReLU-nets, where by a ReLU-net we understand a feedfor- 
ward neural network having ReLU activation functions for all its neurons 
(also including the output layer). More specihcally, we have seen how a con- 
tinuous function / : [ 0 , l\ d —> R can be learned by a one-hidden layer neural 
network with enough neurons in the hidden layer. We show here that in the 
case of a ReLU-neural network the width of the hidden layer can be reduced 
at the expense of the depth, namely, the same output is obtained by adding 
extra hidden layers of a certain bounded width, see Fig.10.4. The next resuit 
is taken from Hanin [52]: 

Propositiori 10.2.5 Let N denote a ReLU-network (as in definition 10.2.1) 
with input dimension d, a single hidden layer of width n, and output 


Exact Learning 


293 



Figure 10.4: Two ReLU-networks, one “shallow and wide” and the other, 
“deep and narrow” can compute the same function. 

dimension 1. Then there exists another ReLU-net, N, that computes the same 
function as N on [0, l] d , having the same input dimension d, but having n + 2 
hidden layers, each of width d + 2 . 

We note a decrease in the width from n to only d + 2 , while the number of 
hidden layers increased from 1 to n + 2 (In general n is much larger than d). 

Proof: Consider the affine functions computed by the hidden neurons 

d 

A j(x) = T w kjXk + bj, 1 <j<n, x € [ 0 , l] d 
k= 1 

Then the activation of the jth neuron in the hidden layer is given by yj = 
(.ReLU o Aj){x ), 1 < j < n. It follows that the input-output function of the 
network N is 

n 

fu — b + E A jReLU(Aj(x)), 

3 = 1 

where b is the bias of the output neuron and A j are the hidden-to-output 
weights. 

Next, we shall construet the network A/*, such that = fjy(x), for all 

x G [ 0 , l] d . In order to specify A/", it suffices to provide its weights and biasses, 
or equivalently, to specify the affine transformations Aj computed by the jth 
hidden layer of N. 





























294 


Deep Learning Architectures, A Mathematical Approach 


Since the function x —> XjReLU(Aj(x)) is continuous on the 

compact set [0, l] d , then it is bounded, and hence there is a constant T > 0 
such that 

n 

T + Y^ A jReLUiAjix)) >0, \/x E [0, l] d . 

3 = 1 

The affine transformations A\ : [0,1]^ —? M d+2 , Aj : R' /+ “ —> E' /+2 . 2 < j < 
n + 1, and A n +2 '■ R d+2 —> R are defined by 

Ai(x) = (x,Ai(x),T) 

Aj{x,y,z) = (x,Aj(x),z + Xj-iy), 2<j<n + l 
A n+ 2 (x,y,z) = z-T + b. 

We shall track the first few layer activations. The first layer activation is 

X (1) = ReLU o A\(x) = ReLU(x, Ai (x),T) = (x, ReLU(A\ (x)), T) 

X (2) = ReLU o A 2 o ReLU o Ai{x) = ReLU oA 2 {x, ReLU(Ai{x)),T) 

= (x, ReLU(A 2 (x)),T + \\ReLU {A\(x)) 

X (3) = ReLU o i 3 o ReLU o A 2 o ReLU o Ai (x) 

= ReLU o A 3 (x, ReLU(A 2 {x)),T + X x ReLU{A^x))) 

— ReLU ( x , A 3 (x), T + X\ReLU {A\(x)) + X 2 ReLU {A 2 {x))') 

= (x,ReLU{A 3 (x)),T + X x ReLU {A^x)) + X 2 ReLU(A 2 {x))). 

Inductively, we obtain 

p— i 

X {p) = (x,ReLU(A p (x)),T + Y x j ReLU ( A j( x )))’ Vp<n + 1. 

3 = 1 

The last layer activation of the network N is given by 

X (n+2) = £ n+2 o X {n+1) 

n 

— ^-n+2 (fX-> ReLU(^A n -pi (x)), T T E XjReLU (Aj(x))^j 

3 = 1 
n 

— b + XjReLU(Aj(x)) — x G [0, l] d . 

.7=1 

Hence, fj^{x) — fj\r(x), for all x G [0, l] d , which ends the proof. ■ 


Lemma 10 . 2.6 The input-output function of a ReLU-neural network is a 
continuous piecewise linear function. 



Exact Learning 


295 


Proof: Let / denote the input-output function of the ReLU-network with L 
layers. Then 

f{x) = (A l o ReLU o • • • o ReLU o A 2 o ReLU o A x ){x). 

Since / is a composition of continuous functions (both ReLU and the affine 

linear functions Aj are continuous), it follows that / is continuous. Because 

ReLU is piecewise, the aforementioned composition is also piecewise. Since 

TTT/ f ax — b, if x>b/a 

ReLU (ax — b) — < _ . r ' 

v 7 [0, if x < b/a , 

the composition of ReLU and any of the affine functions Aj is also piecewise 
linear. The repetitive composition of these types of functions is also piecewise 
linear. 


The converse statement also holds true, see Arora et al. [10]. In the fol- 
lowing we shall present a slightly different resuit where the network is allowed 
to have both ReLU and linear neurons. 


Propositiori 10.2. 7 Let f : R n —> R be a given continuous piecewise linear 
function. Then there is a feedforward neural network, having both ReL U and 
linear neurons, which can represent exactly the function f. The network has 
L — 2([log 2 n\ + 2) layers. 

Proof: The proof is based on a resuit of Wang and Sung [122], which States 
that any continuous piecewise linear function / can be represented as a linear 
combination of piecewise linear convex functions as 


p 


/ = £ 


3- 


: 1 


Sj maxT 
ieSj 


where {f?i,..., £r} is a finite set of affine linear functions and Si C {1, 2,..., k}, 
each set Si having at most n + 1 elements and Sj G { — 1,1}. The function 
gj — maxi^Sj h is a piecewise linear convex function having at most n + 1 

Sj | < n + 1. Each affine function is computed from the 


pieces, smce 


input x by a linear neuron (first hidden layer). Each valne gj (x) is computed 
from inputs £i{x) by a ReLU-neural network given as in Proposition 10.2.2 
(this are 2([log 2 nj + 1) hidden layers). An output linear neuron can compute 
the function / = J2j=i s jtH (^ e output layer). The total number of layers is 
L — 1 + 2([log 2 nj + 1) + 1 = 2( [log 2 n\ +2). I 


Remark 10.2.8 The previous resuit does not provide information about the 
width of the layers. Hanin [52] showed that in the case of a nonnegative 
continuous piecewise linear function, / : R n —> R+, the hidden layers have 
width n + 3 and the ReLU-network has depth 2 p. 





296 


Deep Learning Architectures, A Mathematical Approach 



Figure 10.5: Two-hidden layer exact representation of a continuous function 
f in the case n — 1. 

10.3 Kolmogorov-Arnold-Sprecher Theorem 

A version of the famous Hilberfs thirteenth problem is that there are analytic 
functions of three variables which cannot be written as a finite superposition 
of continuous functions in only two arguments. The conjecture was proved 
false by Arnold and Kolmogorov in 1957, who proved the following represen¬ 
tation theorem, see [9] and [64]. 

Theorem 10.3.1 (Kolmogorov) Any continuous function f{x i,...,x n ) 
defined on I n , n > 2, can be written in the form 

2n+l n 

f(xI,...,x n )= ^2 Xj(XAi( x *))’ 

3 =1 i=1 

where Xj, are continuous functions of one variable and ipij are monotone 
functions which are not dependent on /. 

We note that for n — 2 the previous expression becomes 

f{x 1 ,X 2 ) = Xl(M x l)+^ 2 l(x 2 ))+X 2 Ul 2 (xi)+^ 22 (x 2 )) 

H-f X 5 (+ ^25(^2 H • 

Sprecher rehned Kolmogorov’s resuit to the following version, where the 
outer functions \j are replaced by a single function y and the inner functions 
ifij are of a special type and shifted by a translationi 












Exact Learning 


297 


Theorem 10.3.2 (Sprecher, 1964) For each integer n > 2, there exists a 
real, monotone increasing function xp(x), with ^([0,1]) = [0,1], dependent on 
n and having the following property: 

For each preassigned number 5 > 0, there is a rational number e, with 0 < 
e < 5, such that every real continuous function f(x i,... ,x n ), defined on I n , 
can be represented as 

2n+l n 

f{x 1 , ...,x n )= ^2 x(^ +e(j - 1)) +j-l), 

.7 = 1 1=1 

where the function x is real and continuous and X is an independent constant 

of /. 

The proof can be found in Sprecher paper [115]. This resuit is important 
to the held of neural networks because it States that any continuous function 
on l n can be represented exactly by a neural network with two hidden layers. 
The activation function for the first hidden layer is if and for the second 
hidden layer is x- 

For clarity reasons we have presented the neural network for the case 
n = 1 in Fig. 10.5. In this case each hidden layer has three computing units. 
The weights from the first-to-second hidden layer are all equal to A. All the 
other weights (input to first hidden layer and second hidden layer to output 
layer) are equal to 1. Note that the threshold e can be made arbitrary small. 

The case n — 2 is represented in Fig. 10.6, where for the sake of simplicity 
thresholds have been neglected and all edges without a weight have weight 
equal to 1 by default. The first hidden layer has 10 computing units while 
the second hidden layer has just 5 units. 

Kolmogorov-Arnold-Sprecher’s Theorem represents an interesting theo- 
retical resuit regarding four-layer neural networks. However, it does not have 
mnch practical applicability due to the fact that the activation functions xjj 
and x are not set a priori, and their construction is laborious. In fact, it 
is worthy to note the following trade-off property: in Kolmogorov’s Theo¬ 
rem the number of intermediary units is given, but there is no control on the 
expression of ^ and x* However, in some of the previous results the activation 
function was fixed a priori, while the number of hidden units were increased 
as needed until some desired level of approximation accuracy was reached. 

10.4 Irie and Miyake’s Integral Formula 

Another resuit of theoretical importance is the integral formula of Irie and 
Miyake, see [60], which States that arbitrary functions of finite energy (that is 


298 


Deep Learning Architectures, A Mathematical Approach 





Figure 10.6: Two-hidden layer exact representation of a continuous function 
f in the case n — 2. 

L 2 -integrable) can be represented by a three-layer network with an continuum 
of computational units, see Fig. 10.7. More precisely, the results state the 
following: 

Theorem 10.4.1 (Irie-Miyake, 1988) Let f(xi ,..., x n ) G L 2 (R n ) and 
fj(x) G L 1 (R). Let \I/(£) and F{w \,..., w n ) be the Fourier transforms of 
if(x) and f{x i,..., x n ), respectively. If 'F(l) ^ 0 ? then 



Exact Learning 


299 



Figure 10.7: A one-hidden layer neural network with a continuum infinite 
number of computational units. 


r 1 ° 

f{xx n )= / 4>(y^XiWi-w 0 )———F(wi,...,w n )e lWo dw, 

J r«+i / ( 27r) n W(l) 

L -_L 

where dw — dwodwi... dw n . 

In the above formula wq corresponds to the bias, to the connection weights, 
and fi to the activation function in the hidden layer. The connection weights 
between the hidden layer and the output layer depend on the Fourier trans- 
form of / as 

1 

X(w 0 ,wi, ...,w n )= ■ ■ ■, w n )e lWo . 

The summation over a continuum inhnite number of computational units is 
achieved by integrating with respect to the weights and thresholds. We have 
a few points to make. 




300 


Deep Learning Architectures, A Mathematical Approach 


1. Since the logistic function cr(x) — -— ^ L 1 (R), the conclusion of the 

1 e x 

theorem does not necessarily hold for neural networks with sigmoid neurons. 

2 

2. However, the Gaussian function ^(x) — e~ x satisfies both properties 
ijj G L X (R) and 4/(1) ^ 0, which makes the formula valid for neural networks 
with Gaussian activation function. 

3. Another major drawback, which makes this formula not useful in practice, 
is that the function to be realized is given by an integral formula which 
assumes that the network has an continuum infinite number of units. 

4. The idea of recovering a finite energy function of several variables from 
the values along hyperplanes and its associated frequency is similar to the 
principle of the computerized tomography. 


10.5 Exact Learning Not Always Possible 


The message of this section is that there are continuous functions on [0,1], 
which cannot be represented exactly as outputs of neural networks. For this 
purpose, we consider a one-layer neural network having the activation in the 
hidden layer given by the logistic or hyperbolic tangent, and having linear 
activation in the output layer. If an exact representation would exist, then / 
could be written as a finite combination of analytic functions such as 

N 

f(x) — x — 6j). 

3 = 1 


Since a is an analytic function of x (as an algebraic combination of e x \ then 
the right side expression is analytic on (0,1), while the continuous function 
/ can be chosen not analytic, for instance, 


f 0, if 0 < x < 0.5 

[ x — 0.5, if 0.5 < x < 1. 


A similar lack of exact representations can be also found for L 1 and L 2 
functions. 


10.6 Continuum Number of Neurons 

This section will extend the exact learning to a deep neural network with 
a continuum of neurons in each hidden layer. We assume the input belongs 



Exact Learning 


301 



y( t )=S 0 K ( t ’ s )h(s)ds 


0 


0 


0 


input layer 


hidden layer output layer 


Figure 10.8: A three-layer neural net with a continuum of neurons in the 
hidden layer. 


to a compact interval, t G [0,1]. A one-hidden layer neural network, with N 
neurons in the hidden layer, has an output given by y{t) — Ylj=i v j a ( w jt-^bj), 
where a is a sigmoid activation function, Wj are the weights between the input 
and the j th node, Vj is the weight between the jth node and the output, and 
bj denotes the bias for the jth neuron in the hidden layer. 

Now, we replace the finite number of neurons in the hidden layer, by a 
continuous infinite number of hidden neurons, which are parametrized by 
« e [0,1] , see Fig. 10.8. Hence, the index j is replaced by the parameter s. 
Consequently, Wj, bj become continuous functions w(s), b(s ), respectively, 
while the weights Vj are replaced by the measure h(s) ds. The summation is 
transformed into an integral, the network output being given by the integral 
transform 

y(t) = / K{t,s)h{s) ds, (10.6.2) 

J o 

where the System of biasses and weights are encapsulated into the integral 
kernel K(t, s ) = a(w(s)t + 6(s)). 

This section investigates the following problem: Given a continuous func¬ 
tion g G C[0 ,1], does the previous network leam exactly the function g7 
Namely, are there any continuous functions w,b,h G C[0 ,1] such that g(t) = 
/J K[t , s)h(s) ds for ali t G [0,1]? 

The provided answer is partial, holding for a certain subclass of contin¬ 
uous functions, whose Fourier coefhcients tend to zero fast enough. While 
approaching this problem, we shall employ techniqnes of integral operators. 








302 


Deep Learning Architectures, A Mathematical Approach 


The theory of integral equations is well developed in the case of symmetric 
integral kernels. In order to understand better these types of kernels, the next 
lemma deals with the symmetry of kernels of the particular type K(t,s) — 
cr(w(s)t+b(s)). However, the rest of the section will assume symmetric kernels 
of a more general form. 

Lemma 10.6.1 Let wffi G C[0,1] and consider K(t,s) = a(w(s)t + b(s)), 
Vs,£ G [0,1]. The following conditions are equivalent: 

(i) K is symmetric: K(s , t) — K(t , s), Vs, t G [0,1]; 

(ii) w(s) and b(s) are affine, given by 

w(s) — (w(l) — w(0))s + w(0) 

b(s) — rc( 0 )s + 6 ( 0 ), 

with re( 0 ) — 6 ( 1 ) — 6 ( 0 ). 

Proof: (i) (ii) Since a is one-to-one, the symmetry of K is equivalent to 

w(s)t + b(s) = w(t)s + 6 (£), Vs, t G [0,1]. (10.6.3) 

Similarly, taking the limit s —> 0 and using the continuity of w and 6 yields 

b(t) — w(0)t + 6 ( 0 ), 

which shows that b(t) is affine. Taking the limit t 1 in (10.6.3), solving for 
w(s) and using the previous relation for 6 , we have 

w(s) — w(l)s + 6 ( 1 ) — b(s) 

— (re(l) — w(0))s + 6 ( 1 ) — 6 ( 0 ). 

Then taking s = 0 provides re(0) = 6(1) — 6(0). 

(ii) => (i) Using the expressions of w(s) and 6 (s), a computation shows that 

w(s)t + b(s) — (w( 1 ) — w(0))st + w(0)(t + s) + 6 ( 0 ) = w(t)s + b(t). 

This implies that (10.6.3) holds, and hence K(t,s) — K(s,t). ■ 

It is worth noting that the kernel in this case becomes 

K(t , s) = a(ast + /3(s + t) + 7 ), 

with a — w( 1) — a;(0), /3 — rc(0), and 7 — 6(0). Then the symmetric kernel 
depends on the sum and the prodnct of its variables, s and t. 






Exact Learning 


303 


So far, we have determined the functions w(s) and b(s ). We stili need to 
find the weight function h(s). We shall determine it by solving the integral 
equation of the first kind 

g(t) = f K(t, s)h(s) ds. (10.6.4) 

J o 

Since the kernel K is continuous, the integral operator on the right side 
improves the smoothness of the function h. Therefore, in general, the equa¬ 
tion (10.6.4) with arbitrary continuous g cannot be solved by a continuous 
function h. Thus, we need to assume more restrictive properties on the func¬ 
tion g. 


Before getting any further, we shall recall a few properties of integral 
operators, which will be used later: 

1 . If there is a number A and a function e(-) such that 

rl 


/ K(t, s)e(s) ds — Xe(t) 

J o 


then A is called an eigenvalue and e(t) an eigenfunction corresponding to the 
eigenvalne A. 

2. If K is symmetric then all eigenvalues are real. 

3. If K is symmetric and nondegenerate, there are infinite, countable number 
of eigenvalues and eigenvectors (The kernel K is called degenerate if K(t, s ) = 

ELi “<(*)&(*))• 

4. The eigenfunctions corresponding to nonzero eigenvalues are continuous. 

5. The eigenfunctions corresponding to distinet eigenvalues are orthogonal in 
the L 2 -sense. 


6. If /e L 2 [0,1], then J2 n >i fn < II/II 2 , where f n = g f(t)e n (t) dt and 

II/II 2 = fo where e n (t) denotes the nth eigenfunction. This is called 

the Bessel inequality. 


We shall provide without proof three properties involving absolute conti- 
nnity, which can be found in any analysis textbook and will be used later. 

A useful test used to check for the uniform convergence of a series of 
functions is the following: 


Propositiori 10.6.2 (Cauchy) Let f n : [0,1] —> R be a sequence of func¬ 
tions. These are equivalent: 

(i) The series E fn is uniformly convergent on [0,1]; 

n> 1 

(ii) Ve > 0 , there is an integer N > 1 such that 


n+p 

Em 


< e. 


i=n 


Vt G [0,1], Vn > N, Vp > 1. 




304 


Deep Learning Architectures, A Mathematical Approach 


Condition (ii) States that the sequence of partial sums is uniformly Cauchy. 
The next resuit States that the uniform convergence preserves continuity: 

Propositiori 10.6.3 (Weierstrass) If each function f n is continuous at 
xq G [0,1] and the series f n is uniformly convergent on [0,1] ; then the 

n> 1 

series sum is continuous at xq. 


The next resuit provides a necessary condition for uniform convergence: 

Propositiori 10.6.4 Let /, f n : [0, 1] -G [0, +oo) be continuous functions, 
such that f(t) — fn(t) f or t €= [0,1] . Then the series E fn is uniformly 

n>1 n>1 

convergent. 


The next resuit States the existence of the weight function /i, given some 
extra conditions on the function g. We note that the symmetric kernel K 
used in the next resuit is of general type, not of the particular type as the 
one given in Lemma 10.6.1. 


Lemma 10.6.5 Let K : [0,1] x [0,1] -G R be a continuous and symmetric 
kernel. Consider g G C[0,1], satisfying the properties: 

(i) It has the representation g(t) = X^n>i gn e n(f), with g n — fjg(t)e n (t)dt, 
where e n is the nth eigenfunction of the kernel; 

g 2 

(ii) The series is convergent, where \ n denotes the eigenvalue corre- 

n> 1 71 


sponding to eigenfunction e n . 

Then, there is a function h G C[ 0,1] such that 


g(t) = f K(t, s)h(s) ds. 

J o 


Proof: We shall break the proof into several steps. 

Step 1. The series X^n>i ^n e n(^) * s absolutely convergent on [0,1] and uni¬ 
formly bounded in t. 

Applying Bessehs inequality for the function f(s) — K(t,s) and using that 
e n is an eigenfunction, yields 



n> 1 



K 2 (t, s) ds > ^2 ( / K(t,s)e n (s)ds ) = ^ A 2 n e 2 n (t). 


n> 1 


The continuity of K implies K 2 (t, s) ds < M, for all t G [0,1]. Thus, 
^ ~2 n >i ^n e n(^) convergent for any t G [0,1] and also uniformly bounded by 
M~ 


Exact Learning 


305 


Step 2. Let h n — g n /\ n • The series X^n>i is uniformly convergent on 

[ 0 , 1 ]. 

It suffices to show that condition (ii) of Proposition 10.6.2 holds. Let e > 0 
be arbitrarily fixed. Using Cauchy’s inequality we have 



From Step 1 it follows that Yl7=n (t) < M. From hypothesis condition 

g 2 

(ii) we have that the series -j is convergent, so for n large enough 

n> i Xfl 


n+p 



2 


< 


e 


2 


M 


Substituting in the previous inequality we obtain 


n+p 

T hjej(t) < e, Vi E [0,1] 

i=n 

for n large enough and any p > 1. The assertion follows now from Proposition 

10 . 6 . 2 . 

Step 3. Check that h(t) = h n e n (t ) zs a solution. 

n> 1 

This involves two things: h E C[0,1] and h satisfies the integral eqnation. 

By Step 2 and Proposition 10.6.3, using that e n (t) is continuous, it follows 
that /i(£) = X^n>i hn e n(t) is continuous on [0,1]. 

Since uniform convergence allows integration term by term, we have 

f K(t,s)h(s) ds — 

J o 


This ends the proof. 


^ ^ hn I hd(t^ s)e n (s) ds — ^ ^ h n X n e n (t) 

n> 1 n>l 


y^Vnen(i) = v(i). 


n>l 


Remark 10.6.6 1. The convergence condition X^n>i ff < oo implies that 
the Fourier coefhcients g n tend to zero faster than eigenvalues X n do. This 
restricts the class of continuous functions learned by the network. 






306 


Deep Learning Architectures, A Mathematical Approach 


2 . The fact that lim n ^ 00 X n — 0 can be shown by integrating one more time 
in the inequality provided at Step 1 


M— f f K 2 (t, s) dsdt > A 2 f e n {t) 2 dt — A 

J ° J ° n> 1 J ° n> 1 


2 

n’ 


Since the series ^ n>1 ^n ' ls convergent, it follows that X n tends to zero. 

3. There is a more general resuit of Picard, see [29] p.160, which States that a 
necessary and sufficient condition for the existence of a solution h G L 2 [0,1] 
for the integral equation 


g{t) — f K(t,s)h(s)ds 

J o 


is the convergence of the series It is worth noting that this resuit 

n> 1 n 

holds also for nonsymmetric kernels K. 


The results of lemmas 10.6.1 and 10.6.5 can be combined, by considering 
a symmetric kernel of the type K(t , s ) = cr(ast + /3(s + t) + 7 ) and obtain the 
existence of a weight density h(s ) such that the network learns a continuous 
function g , whose Fourier coefficients decrease to zero faster than the square 
of the eigenvalues of the kernel. 


10.7 Approximation by Degenerate Kernels 


Another way of dealing effectively with kernels is to approximate them with 
degenerate kernels, i.e., with kernels that can be represented as a finite sum of 
separable functions in the underlying variables, namely, as finite sums of the 
type A n (t, s ) = X^ILi The next resuit approximates uniformly the 

continuous kernel K(t,s) by degenerate kernels with cq and /3i continuous. 

Propositiori 10.7.1 Let K : [0,1] x [ 0 , 1 ] oR be a continuous kernel. Then 
Ve > 0, there is an n > 1 and cq, G C[ 0, 1 ], i = 1 ,..., n, such that 


n 

K(t,s ) - ^2ai(t)Pj(s) 
i—1 


< c, 


Vt, s G [0,1]. 


Proof: This is a reformulation of Proposition 7.4.5 and a conseqnence of the 
Stone-Weierstrass Theorem. ■ 

In the following we address the question of how is the network output 
affected if the kernel K is replaced by a nearby degenerate kernel. 




Exact Learning 


307 


Let h G L 2 [ 0,1] be fixed and denote the outputs by 

9(t)= / K(t, s)h(s) ds, g n {t)= / A n (t, s)h(s) ds, 

J 0 J 0 

n 

with A n (t,s) = aj(t)/3j(s ) given by Proposition 10.7.1. Both functions g 

i=l 

and are continuous. In particular, g n is a linear combination of continuous 
functions cq 

n 

9n(t ) = 

i =1 

with coefficients q = f} /3i(s)h(s) ds < \\(3i\\ \\h\\. 

The next resuit States that whatever the network can learn using arbitrary 
continuous kernels, it can also learn using degenerate kernels, and hence the 
sufficiency of using degenerate kernels. 

Proposition 10.7.2 The sequence g n converges uniformly to g on [0,1] as 
n —> oo. 

Proof: Let e > 0 be arbitrarily fixed. By Proposition 10.7.1 there is an integer 
n > 1 such that 


K(t,s) - A n (t,s ) 


< 


h || 


Vi, s G [0,1]. 


Using Cauchy’s inequality, we have 


\g{t) ~9n(t)\ 2 = 


0 


'1 2 
(iL(t, s ) — A n (t, s)) h(s) ds 


< f | K(t, s) — A n (t, s) 2 ds f h 2 (s)ds 

J 0 J 0 


< 


^11 


/i|| 2 — e 2 . 


Hence, for the previous n, we obtain 


max | g(t) - g n (t)\ < e. 
te[ 0,1] 


Vt G [0,1] 


which ends the proof. 









308 


Deep Learning Architectures, A Mathematical Approach 


10.8 Examples of Degenerate Kernels 


Since the functions cq and are provided by the Stone-Weierstrass Theorem, 
which is an existential resuit, it follows that an explicit form for the degen¬ 
erate kernel, A n (t, s), is not known in general. If more restrictive conditions 
are required for the kernel, explicit formulas for the degenerate kernels can 
be constructed. We shall deal with this problem in the following. 

The extra conditions imposed on K are to be symmetric and nonnegative 
definite. The symmetry is needed for the existence of real eigenvalnes. Recall 
that a continuous, symmetric integral kernel K is called nonnegative definite 
if all its eigenvalues are nonnegative. Equivalently, this can be represented 
by the integral condition 

i 

K(t, s)u(t)u(s ) dtds > 0 , \/u E C[0 , 1 ], 



which can be interpreted as a nonnegative integral quadratic form . 1 

The following characterization theorem of nonnegative definite kernels can 

be found in Mercer [83] and shall be used shortly: 


Theorem 10.8.1 (Mercer) Let K be a continuous, symmetric and nonneg¬ 
ative definite integral kernel on [a, b ]. Denote by {e n } n >i an orthonormal basis 
for the space spanned by the eigenfunctions e n corresponding to the nonzero 
eigenvalues \ n . Then the kernel can be expanded as a series 


K(t, s) = ^2 ^ n e n (t)e n (s), Vs,2e[0,1] 

n> 1 


which converges absolutely, uniformly and in mean square. 

This means that we may consider degenerate kernels of the type 

n 

A n (t, s) = ^2 \ei(t)ei(s), 

2=1 

which converge uniformly to the kernel K(t , s), by the previous resuit. 

In this case g n (t ) can be written as a linear combination of eigenfunctions 

rl n 

g n (t) = / A n (t, s)h(s) ds = V" A ihiefit), (10.8.5) 

2=1 

x This definitiori is based on the following resuit proved by Hilbert [54]: 
fo fo K(t,s)u(t)u(s) dtds = E n ^(bW, where (u^n) = /q u(s)ip n (s) ds, where A n 
is the eigenvalue corresponding to the eigenfunction p n . 





Exact Learning 


309 


with Fourier coefficients hi — ei(s)h(s ) ds. Consider 

n 

h n(i) = 

i=l 


be the projection of h{t) on the space spanned by {e \(£),... , e n (£)}. Since 
ei(t ) are continuous, the function h n (t) is continuous; these functions will be 
used to approximate the function h E L 2 [ 0,1]. 

In order to simplify notations, we employ the integral operator notations 

/C (h)(t) = f K(t, s)h(s) ds, A n (h)(t) = f A n {t, s)h(s) ds. 

J 0 Jo 

From the expression of g n given by (10.8.5), which uses only coefficients 
hi with i < n, we obtain 


/ A n (t, s)h(s) ds — / A n (t,s)h n (s)ds, 

J 0 J0 

or, equi valent ly, A n (h) — An(h n ). 

Using that ej[t) are orthonormal, we obtain 

/C(h n )(t) - Ai(h n )(i) = [ [K(t, s) - A n (t,s)]h n (s)ds 

J o 



since z and j are never equal. Hence /C(h n ) = ^4 n (h n ). Now, we put all parts 
together. Triangle inequality, together with Proposition 10.7.2 and previous 
two relations, yield 


I JC(h) - /C(h n )| < I K(h) - A n (h) I + IAiW - Ai(M + I4(h n ) - /C(h n ) 

^ ^ ^ ^ ^ - 


<e 


=0 


=0 


This means that for any e > 0, there is n > 1 such that 


max |/C(/i)(£) — /C(h n )(£)| < e. 
te[o,i] 

The interpretation of this resuit follows. From part 3 of Remark 10.6.6, there 
is a solution h E L 2 [0,1] of the equation K(h)(t) — g(t ) if and only if the func¬ 
tion g satisfies some restrictive condition in view of the increase in eigenvalues 
X n . The network can now learn the outcome g(t) by continuous outcomes, 
/C(hproduced by using continuous weight functions h n (t). 










310 


Deep Learning Architectures, A Mathematical Approach 


o 



J i /»i 

J K x (t, s)K 2 (s, u)h(u)duds 


input layer 


hidden layers 


output layer 


Figure 10.9: A four-layer neural net with a continuum of neurons in each 
hidden layer. 


10.9 Deep Networks 

Until now we have studied only the case of a network with one hidden layer 
having a continuum of neurons. Using iterated kernels, in this section we 
consider multiple hidden layers of the same type. For instance, Fig. 10.9 
represents a two-hidden layer neural network of this form. The weights and 
biasses for the hrst hidden layer are encapsulated in the kernel K\{t, s), while 
the weights and biasses for the second layer are represented by the kernel 
K 2 (s,u). The weight function from the second hidden layer to the output is 
denoted by h{u). The continuous parameters y,s,u represent the input and 
the parameters for the second and third layers of the network. The output of 
the network given in Fig. 10.9 is given by a double integral 

y(t) = / / K\{t,s)K 2 {s,u)h{u) duds. 

JoJo 


If let 

G{t,u) = / K\(t,s)K 2 (s,u) ds, 

J 0 

then the output can be represented equivalently as 

y(t) = / G{t,u)h(u) du, 

J o 

which looks like the output (10.6.2) but with a more complicated structure 
of the kernel K. If the kernels K\ and K 2 are symmetric, then the kernel G is 










Exact Learning 


311 


not necessarily symmetric. However, if we further assume that K\ — K 2 — K , 
then the iterated kernel 

K {2 \t,u)=[ K(t, s)K(s, u) ds 

J 0 

is symmetric. Hence, for preserving the symmetry of the network kernel, the 
layers have to share the same kernel, K. The same procednre can be applied 
to a network with any number of layers. 

What is the novelty brought by these types of deep networks? We noticed 
that in the case of a one-hidden layer neural net, the expansion of the kernel 
as 

K{t , s) — ^ ^ 

n> 1 

is not valid in general. Mercer’s Theorem hxed this problem by asking some 
restrictive assumptions on the kernel K regarding its defmiteness. The gain 
of deeper networks is that the iterate kernel K^ satisfies a similar expansion, 
provided the initial kernel K is only symmetric. The expansion is 

K (2 \t,u) = yp 2 e n (t)e n (u). 

n> 1 

Similar expansions hold for iterated kernels associated with deeper networks. 
If we dehne the nth iteration recursively as 

K( n \t,u) = f s)K(s, u) du, 

J 0 

then the expansion 

K (n) (t,n) = A ?ej(t)ei(u), 

i> 1 

converges absolutely and uniformly. The proof of this resuit can be found in 
Courant and Hilbert [29], p.138. 

Therefore, all results of section 10.8, which are proved using Mercer’s 
Theorem under the assumption that K is positive definite, are valid in the 
general case of a deep neural net with at least two hidden layers and the 
same kernels in each layer. 

10.10 Summary 

Sometimes a neural network can represent the target function exactly. We 
have discussed the case of exact learning of functions with finite support. 
This is equivalent with the fact that the network memorizes given data; the 


312 


Deep Learning Architectures, A Mathematical Approach 


network can be replaced by a look-up table, having no ability to generalize 
to other new data. 

Another exact representation resuit is Kolmogorov-Arnold-Sprecher’s The- 
orem, which is related to the answer to the famous Hilberfs thirteenth prob- 
lem. The theorem States that a two-hidden layer neural network can represent 
exactly any continuous function on the n-dimensional hypercube. This resuit 
was a joint effort of Arnold, Kolmogorov, and Sprecher around 1960s. This 
deep mathematical resuit has only a theoretical valne for neural networks, 
the theorem being just existential and not constructive. 

Irie and Miyake’s integral formula is another theoretical resuit dealing 
with an integral formula, which States that arbitrary functions of finite energy 
can be represented by a three-layer network with an continuum of computa- 
tional units. 

The last part deals with the exact learning in the case of a continuum 
number of neurons with one and two hidden layers, and necessary conditions 
for learning continuous functions are developed. In this case the activations of 
the type cr(wjX + bj) are replaced by integral kernels K(s , t). The main resuit 
shows that the network can represent exactly continuous functions whose 
eigenvalues increase fast enough. Only the case of symmetric and nonnegative 
kernels can be fully treated, since in this case there are some mathematical 
tools already developed by Hilbert and Mercer. Deep neural network of con¬ 
tinuum neurons is introduced, the case of shared kernels being the one with 
the nicest properties. 


10.11 Exercises 

Exercise 10.11.1 (a) Show that a multi-perceptron neural network with 
one hidden layer and 2 computing units in the hidden layer can implement 
exactly the XOR function, i.e., a function satisfying g(0, 0) = 0, g(0,1) = 1, 
g{ 1 , 0 ) = 1 , and g{ 1 ,1 ) = 0. 

(6) Draw the network and state the weights and biases. 

Exercise 10.11.2 (a) Show that there is a multi-perceptron neural neural 
network with one hidden layer that can implement exactly the following 
mapping: 

( 0 , 0 , 0 ) -> 1 , ( 1 , 0 , 0 ) 0 , ( 0 , 1 , 0 ) 0 , ( 0 , 0 , 1 ) 0 , 

(0, 1 , 1 ) 0 , ( 1 , 1 , 0) 0 , ( 1 , 0 , 1 ) 0 , ( 1 , 1 , 1 ) 1 ? 


(6) Draw the network and state the weights and biases of all perceptrons in 
the network. 



Exact Learning 


313 


Exercise 10.11.3 Write the Irie-Miyake formula for the exponential function 

i/j(x) — e~ x2 . 

Exercise 10.11.4 Construet a continuous function / E C[ 0,1] that cannot 
be learned exactly by a neural network with a logistic sigmoid activation 
function. 

Exercise 10.11.5 Prove a similar resuit to Lemma 10.6.5 for deep networks 
with n hidden layers. 

Exercise 10.11.6 Let K be a symmetric kernel with eigenvalues X n . Show 
that the series convergent for any positive integer n. 

Exercise 10.11.7 Consider the data points {(1,1), (3, 3), (5, 2), (7,1)}. 

(а) Find a simple function c(x), which learns exactly the given data; 

(б) Find aj and Oj such that G(x) = Y^j=i a jH(x — Oj) learns exactly the 
given data. 

Exercise 10.11.8 Consider /i : R n —> R m and /2 : R m —> R fc be two func- 
tions that are represented exactly by two feedforward networks with acti¬ 
vation function 0. Show that there is a neural network, having activation 
function 0, which can represent the composed function /2 ° /1 : R n — > R fc . 


Part III 
Information Processing 



® 

Check for 
updates 


Chapter 11 

Information Representation 


This chapter deals with the information representation in neural networks 
and the description of the information content of several types of neurons and 
networks using the concept of sigma-algebra. The main idea is to describe 
the evolution of the information content through the layers of a network. The 
network’s input is considered to be a random variable, being characterized 
by a certain information. Conseqnently, ali network layer activations will be 
random variables carrying forward some subset of the input information, 
which are described by some sigma-helds. From this point of view, neural 
networks can be interpreted as information processors. 

The neural network’s ability to generalize consists in using only that 
part of the input information that is useful for the task at hand. This is a 
procedure by which most of the input information is discarded by a selective 
process of extracting only those useful characteristic features. The output 
of the network caries some information, which is a sub-sigma-algebra of the 
input information and contains the useful information for the generalization. 
For instance, if the inputs are pictures of males and females, we assume the 
network has to hgure out the gender of the person. Knowing that the gender is 
a 1-bit of information, the network has to selectively decrease the information 
content from several bits of initial information of each input picture to only 
the useful needed information of 1 bit. 

This process of compressing the information through the layers of a net¬ 
work involves a certain content of information that is lost at each layer of 
the network. We dehne the uncompressed layers by means of minimum lost 
information and we shall study their properties. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_ll 


317 



318 


Deep Learning Architectures, A Mathematical Approach 



Figure 11.1: Neural network as an Information approximator: input, output 
and desired information fields. 


Before reading this chapter, the reader who is not very familiar with mea- 
sure and probability theory notions is recommended to consuit the Appendix. 

11.1 Information Content 

In this chapter we shall assume that the input, output, and target of a neural 
network are random variables. To be precise, let D denote the sample space 
of a probability space and K bea sigma-field on Q. The input A is a random 
variable, X : D —> R n , which maps each state oj E D into n real numbers 
(Ai(u;),..., X n (uj)^ E R n . If g denotes the input-output mapping of the 
network, then the output Y — g(X) is also a random variable. The network’s 
target variable is considered to be a real-valued random variable, X : > R. 

We shall introduce next three bodies of information useful for the study 
of neural networks using the concept of sigma-algebra. They basically specify 
the set of possible events which can be potentially observed either at the 
input of the network, or at the output, or in the target. See Fig. 11.1 for a 
diagram of the information flows. 

The input information represents the sigma-algebra generated by the 
input variable X and is denoted by X. It contains all the events that can 
be observed from the given data. 

The output information is the sigma-algebra generated by the output 
variable Y and is denoted by £. This is the set of events that are observed 
at the output of the network. Since the output depends on the input in 
a deterministic way, namely, Y — g(X) with g measurable function, then 
Proposition D.5.1 implies the inclusion &(Y) C 0(X), or £ C X. This means 
the output information is always coarser than the input information. 

The target information is the sigma-algebra generated by the target vari¬ 
able X, namely, Z — &(Z). In general, it is expected to have Z C £, since 
we want to learn a hner set of events from a coarser set. However, there is 
no inclusion relation between the target and the input information, but if 





















Information Representation 


319 



X 


Infoinput 


=;> 


J 

Y 

<D— 

Info output 


b 


Figure 11.2: a. Neuron as an information processing unit. b. Information 
flows in and out of a neural network. 


Z C Z, we say that Z is learnable from X. This means that all events in Z 
are also events in X. Equivalently, Z is determined by X in the sense that 
exists a measurable function / such that Z — f(X). 

The error information is the sigma-algebra generated by the events that 
can be observed in the target but not in the output information, namely, 1Z — 
&(Z\£). Here, we assumed the inclusion Z C £ satisfied. In the particular 
case Z — £ the error information is minimal, X — {0,11}. 

A neuron can be seen as an information processing unit consisting of two 
blocks, the hrst collecting the input information X, and the second one eval- 
uating the information using a System of weights and an activation function, 
see Fig. 11.2 a. Similarly, a neural network is a special case of a black box in 
which information I goes in and information £ comes out, see Fig. 11.2 b. 
The hope is to be able to use the output information to learn the target infor¬ 
mation. That involves tuning the network parameters to obtain an ascending 
sequence of sigma-algebras, £ n , which tends to Z. This means £ n C £ n +i with 
Un £n — Z. The errors of this approximation are given by the events that 
can be observed in the target but not in the output sequence. They generate 
the descending sequence of sigma-algebras 1Z n — &(Z\£ n ). The coarser the 
7£ n , the better the learning of Z is. 

The pattern qualities of these information types are described by some 
distribution measures. We shall discuss them next. 

The pattern of relative frequency of occurrence of X is described by the 
distribution measure, /i, of X. This can be induced from a probability measure 
P defined on the measurable space (H, T~L) as in the following 

p(A) = P (u € X-^A)) = P(w; X(uj) E A) = P(X E A), VA € £(M n ), 

i.e., measures the chance that X takes values in the measurable set A. 
Using symbolic infmitesimal notation, we shall also write the previous relation 
informally as d/a(x) — /i(dx) — P(X G dx). Since /i(R n ) = P(X G R n ) = 1, it 
follows that /i is a probability measure on the measurable space (R n , £>(R n )). 
























320 


Deep Learning Architectures, A Mathematical Approach 


The joint random variable of the input and target, (X, Z) : D -e R n x R, 
defines training evens 


eA,Z(cu) eB} e U 

for any Borei sets A e £>(R n ) and B e £>(R). The probability of each training 
event is given by 

p(B, D ) = P(X G A, Z e B ), VA G B( R n ), B G B(R), 

where p is the joint distribution measure of (X, Z), which describes the train¬ 
ing pattern and is called the training measure. This can be equivalently writ- 
ten as 1 

dp(x, z ) = P(X G dx , Z G dz). 

We shall present next a few examples concerned with characterizing the 
events that random variables can represent. 

Example 11.1.1 We can make a comparison between learning of a neural 
network and understanding process of a human mind. The mental concepts 
are similar to the elements lu of the sample space O of a probability space. 
The information contained in the mind corresponds to a certain ©-algebra £ 
on D. The mind is able to understand an exterior body of information Z if 
its own information is hner than Z. If the ©-algebra i? is not contained into 
£, then the mind can partially understand only the intersection information 

fnz. 


Example 11.1.2 As an extreme case, when Z = c, constant, then the target 
information Z — &(Z ) = {0,0} is the smallest possible ©-field. This corre¬ 
sponds to the information with the least possible detail. This information is 
contained in any other body of information. For instance, we can think of Z 
as the random variable describing weather prediction at the North Pole. The 
information generated by Z in this case is minimal. 


Example 11.1.3 If the target variable is an indicator function, Z — ln, 
with A measurable set, then the target information is given by Z — &(Z) — 
{0,A,A C ,O}. Consequently, the input information should contain A, i.e., 
A G X. The random variable 


Z(cv) 


1 , if lj e A 
0 , if oj ^ A 


Tn the case when p has a density, we write dp(x,z ) = px,z{x, z), where px,z is the 
training density function. 



Information Representation 


321 



1 

0 


Figure 11.3: Z acts as a space separator. 


acts as a separator between the set A and its complementary A c , see Fig. 11.3. 
Therefore, learning Z is equivalent with learning the set A (or the set A c ). 
As an informal example, we can think of the random variable Z as classifying 
animals at the zoo into mammals and non-mammals. 

Example 11.1.4 In general, if the target is a simple function Z = J2j=i c jl Aj 
with {Aj}j partition of and Cj G R, then Z acts as a classifier, associating 
the number Cj as the label of the set Aj. The target information Z is the 
S-field generated by the sets {Aj}j. In this case it is necessary that Aj G I. 


11.2 Learnable Target s 

This section will deal with the particular case of network target, Z, which is 
learnable from the input, A, namely, when the inclusion Z Cl holds. 

More precisely, if Z is learnable form A, then by Proposition D.5.1, there 
is a measurable function / : R n —> R such that Z — /(A), see Fig. 11.4 a. 
Hence, to learn Z it suffices to approximate the measurable mapping /. 
Therefore, it suffices to show that a neural network behaves as an universal 
approximator for measurable functions / G Ai n . This resuit was presented 
in Chapter 9, section 9.3.4 and will be used in the proof of the main resuit. 

Before presenting the resnlts, we prepare the ground with two summary state- 
ments: 

(i) What is given: The input random variable A, its distribution measure z/, 
and the target random variable Z, as well as the training measure p. 

(ii) What is to be learned: The function /, which satisfies Z = /(A). In the 
learning process the function / is approximated by a measurable function g, 
where the random variable Y = g( A) is the network output. 









322 


Deep Learning Architectures, A Mathematical Approach 



Figure 11.4: a. The input and output variables. b. The measures P, /i, and v. 


We also note that learning the function / also implies learning the distribu- 
tion measure of Z, denoted by z/, see Fig. 11.4 b. This follows from 

v{B) = P(ZeB) = P(/(l)eB) = P(le/" 1 (B)) 

= gr\B))=^of-\B), VB € £>(R), 

and the fact that /i and / -1 are either given or learnable. The relation with 
the training measure is given by 

is(B) = P(Z G B) = P(X G R n , Z G B) = p{ R n , B), MB G B( R), 

which shows that the training measure determines the measure v. 

The hrst resuit of this section formulates an analog of Theorem 9.3.24 in terms 
of random variables. The theorem States that the target random variable, 
which are learnable, can be actually learned in the probability sense (for 
defmition see Appendix, section D.6). 

Theorem 11.2.1 Consider a one-hidden layer neural network with a con- 
tinuous sigmoidal activation function (in the sense of Defmition 2.2.1) for 
the hidden layer and one linear neuron in the output layer. Given an n- 
dimensional input random variable, X, and an one-dimensional learnable 
target random variable, Z, there is a sequence of output random variables 
(Yfc)fc produced by the network, 

N k 

Y/c = ^ ctj(p(wjX + Oj), Wj G R n , oy, 0 3 G R, 

3 = 1 

such that Y/e converges to Z in probability. 

Proof: Since Z is learnable, by Proposition D.5.1 there is a measurable 
function / G A4 n such that Z = f(X). From Theorem 9.3.24 there is a 
sequence of functions G & of the type 

N k 

Gk(x) — ctj(p(wjx + 6j), Wj G R n , oij, Oj G R 




Information Representation 


323 


such that d/tiGk, f) —> 0 , as k —> oo, where // denotes the distribution measure 
of X. 

Let e > 0, arbitrary fixed. Consider the set 


A k = {x; | G k {x) - f(x )| > e} C M n . 


The previous d M -convergence can be written as ji(A k ) 
also that 


0, k oo. Note 


{c v;X(u>) G A k } = {cv;\G k (X(cv)) - f(X(u))\ > e}. 
Define Y k — G k (X), see Fig. 11.5 a. Then 


P(w; | Y k (u) - Z(u>) 


> e) = PO; | G k (X(co)) - f(X(u))\ > e) 

= P( W ;XHe4) = P(I- 1 (4)) 


— —y 0, k —y oo. 


Hence, converges in probability to Z . ■ 

One possible interpretation of this theorem is given in the following. Assume 
that the neural network is a machine that can estimate the age Z(u) of a per- 
son c u. The input X{uj) represents n features of the person lj (weight, height, 
fat body index, etc.), while the random variables Yj^{u) are estimates of the 
person’s age. In the beginning the estimates given by the machine will be very 
likely to be off, but after some practice the probability to estimate accurate 
the person’s age will increase and in the long run will tend to 1 . 

But no matter how skilful the machine can be, there is always a small 
probability of making an error. If these errors are eliminated, we obtain a 
sequence that converges almost surely to the target variable, i.e., the age of 
the person. Next we present this consequence. 


Corollary 11.2.2 In the hypothesis of Theorem 11.2.1 there is a sequence 
of random variables {Yj)j produced by the network such that Yj —> Z almost 
surely. 


Proof: By Theorem 11.2.1 the network can produce a sequence of random 
variables (Yjf)k such that Yk converges to Z in probability. Since the estima- 
tion error, \Yk — Z |, can be made arbitrary small with probability as close 
as possible to 1 , hxing an error ej — i, the probability of being off by more 
than ej decreases to zero as 


n\y k -z 


> €j) —>► 0, 


k — y oo. 


This implies the existence of an integer jo > 1 and of a subsequence kj such 


that 



> g) < 


1 



Vi > io- 







324 


Deep Learning Architectures, A Mathematical Approach 



a 


b 


Figure 11.5: a. The estimations Y&. b. The distributiori measures of Y&. 


Taking the sum yields 


E p (in, - z 

3 >30 


3 >30 


1. 


Applying a variant of Borel-Cantelli lemma, see Appendix, Proposition D.6.2, 
implies that Y &. converges to Z almost surely. Renaming Y& by Yj and elimi- 

3 3 J 

nating all the other terms from the sequence, we arrive at the desired sequence 
of estimations. ■ 


In the input process we notice some frequency patterns for the person’s 
features (weight, height, fat body index, etc.), which can be modeled by the 
distribution measure /i of X. We are interested in learning the distribution 
of the person’s age, i.e., in learning the distribution measure v of Z. The 
next resuit shows that we are able to learn this from the patterns induced 
by the estimations Yi, Y25 Y3, etc. The distribution measure of Y^ is denoted 
by Vk, with R + and Vk{B) — P(T/ C E B) = (i o Gf^[B\ for all 

B E £>(R), see Fig. 11.5 b. The concepts of weak convergence and convergence 
in distribution, which will be used shortly, can be found in section D.6 of the 
Appendix. 


Theorem 11.2.3 Let {Yk)k be the sequence of approximations provided by 
Theorem 11.2.1. If vj~ and v denote the distribution measures of the output 
Yk and target Z, respectively, then the measures {yk)k converge weakly to the 
measure v. 


Proof: By Theorem 11.2.1 the sequence Y^ converges to Z in probability. 
Therefore, Y& also converges to Z in distribution. This means that for any 
bounded continuous function g : R -E R we have 

E (g o Yk) E (g o Z), k 00. 

This can be written equivalently using measures as 

/ g{x) dfik(x) -E / g{x) dfi(x ), k -E 00, 

J R J M 






Information Representation 


325 


which means that > v weakly, as k —> oo. 


11.3 Lost Information 

We have seen that for any neural network the non-strict inclusion £ C I 
always holds, as a consequence of the fact that the output, Y, is determined 
by the input, X. Therefore, there might be events potentially observed in 
X, which are not observed in £. Roughly speaking, these events represent 
the lost Information in the network. This concept is formalized in the next 
dehnition. 

Definitiori 11.3.1 The lost Information in a neural network is the sigma- 
algebra generated by the sets that belong to the input information and do not 
belong to the output information, 

£ = &{ 1 \£). 

The lost information describes the compressibility of the information in a 
neural network. The uncompressed case, given by T — £, implies C — {0, O}, 
which is the coarsest sigma-algebra. The total compression case occurs for 
£ — {0,0}, which implies C — X, namely, the lost information is maximum 
(and then the network becomes useless). 

Example 11.3.2 The input of a classical perceptron is given by a random 
variable X, which takes only the values 0 and 1. Let A — X _1 ({1}), so 
we have X = 1 a, i.e., X is the indicator function of the set A. The input 
information is generated by A and is given by 

X= 6(X) = {0,A,A c ,Q}. 

Consider the output variable Y — /(X), with /(1) = a and /(0) = b. Then 
the output information is £ — cr(Y) and its content depends on the following 
two cases. If a ^ 6, then £ = X, and the lost information C is trivial. If a — 6, 
then £ — {0, O}, and then C — X, i.e., in this case all information is lost. 

The next proposition provides an equivalent characterization of the lost 
information. 

Proposition 11.3.3 LetT, £, and £ denote the input, output, and lost infor¬ 
mation of a neural network, respectively. The following hold: 

(a) The input information decomposes as 

1 = 6(£U£) = CV£. 

( b ) £ is the largest field which satisfies (a). 



326 


Deep Learning Architectures, A Mathematical Approach 


Proof: (a) It will be shown by double inclusion. First, since C C X and £ C X, 
then C U £ C 1. Using the monotonicity of the operator © yields 

6(Cu£) C 6(1) =1. 

For the inverse inclusion, we start from 6(1\£) — £, so 1\£ C C. Then 
taking in both sides the union with £, we obtain 

(1\£) U £ C C U £, 

which is equivalent to X C C U £. Taking the operator © yields 

6(1) C 6(C U£) =£W£. 

Using that 1 — 6(1) it follows that 1 C &(C U £). 

(b) Let C' be another ©-field which verihes 1 — 6(C' U £). In order to show 
that C is the largest field which satisfies (a), it suffices to show that C' C £, 
i.e., C' C 6(1\£). The right side can be written as 

6(1\£) = 6(6(£'u£)\£). 


Therefore, it sufhces to show the inclusion 


C' C 6(6(£ / Uf)\S). 


This holds as a striet inclusion since the information C is stili contained into 
the right side, but after extracting £ there might be some sets left that do 
not belong to C . 


To see how the previous resuit works on a concrete case, we consider the 
following example. 

Example 11.3.1 Let A, B E 1 be two sets and consider the 6-helds 

C! = {0, A, A c , U}, £ = {0, B , B c , U}. 

Then CJ U £ — {0, A, B , A c , B c , U}, while &(£ U £) is the sigma-field gen¬ 
erat ed by A and B and has 16 elements, see Exercise 11.10.3. Therefore, 
taking away the elements of £, it follows that &(£ U £)\£ will have only 
12 elements. Among these elements there are also the sets A and A c . Then 
©(6(/yu£)\£) will contain, besides A, A c , and others, also 0 and U. Hence, 
C C &(&(£ U £)\£), with a striet inclusion, since sets such asdflB do not 
belong to C but belong to the ©-field on the right side. 

The relation with the target information is provided next. 


Information Representation 


327 


Propositiori 11.3.4 Assume the target Z of a neural network is learnable 
from the input X. Then 

UcC, (11.3.1) 

where 1Z is the error information and C is the lost information of the network. 

Proof: Since Z is learnable from X, we have Z C I. Extracting the set £ 
from both sides of the previous inclusion, we get Z\£ C 1\£. Using the 
monotonicity of the operator 0 we obtain 

&(Z\£) c 6(Z \£), 

which is equivalent to relation (11.3.1). ■ 


11.4 Recoverable Information 

The concept of information recovery is related to the question of how much 
information can be excluded from a &-field, such that it can be recovered from the 
remaining information by unions, intersections, and complement operations? 

The main resuit of this section States that the lost information of a neural 
network is non-recoverable. Therefore, in a sense, lost and recovered informa¬ 
tion have opposite connotations. 

Definitiori 11.4.1 (a) A G-algebra T inZ is called recoverable if &(Z\fF) — 

Z. 

(■ b ) A G-algebra ZL in Z is called non-recoverable if &(Z\ZL) ^ Z. 

Two ©-fields A and B are comparable if either A C B or B C A. An 
ascending sequence of ©-fields oriented by inclusion is called a filtration. For 
more details on filtrations, see the Appendix. Also, note that if any ascending 
sequence of ©-fields is included in a given ©-field X, then by Zorn’s lemma (see 
Lemma A.0.3 in the Appendix) there is a maximal element of the sequence. 
This is the existence basis of the next concepts. 

A ©-algebra T is maximal recoverable if it satisfies &(Z\fF ) = Z and 
is maximal with respect to inclusion. Similarly, a ©-algebra ZL is minimal 
non-recoverable if G(Z\ZL) ^ Z and it is minimal with respect to inclusion. 

The following two properties are straightforward: 

1. Any maximal recoverable information field, J 7 , is included in a minimal 
non-recoverable information field, ZL. 

2. Any minimal non-recoverable information field, ZL contains a maximal 
recoverable information field, J-. 

It will be ciear from the next examples that the information fields J- and 
ZL are not unique. 


328 


Deep Learning Architectures, A Mathematical Approach 


Example 11.4.2 Consider the 0-algebra generated by one measurable set, 
A, given by I = {0, A, A c , O}. It is easy to show that the maximal recoverable 
information is the trivial one, J- — {0,0}. It is obvious that the minimal 
non-recoverable information is the entire space, T~L — X. 

Example 11.4.3 Let A, B be two measurable sets in O and consider 

1 ={0, A, B , A c , B c , A n B, A U B, A c n B c , A c U B c , A c n B, A n B c , 
i c uB,iu B c , (A n B c ) u (A c n B), (in B) n (in B) c ,0} 

be the @-held generated by A and B. Then 

B a = {0,A,A c ,D} 

is a maximal recoverable information. In order to show that 6{I\J 7 a) — 2T, it 
suffices to recover the set A from the set T\Fa by nsing union, intersection, 
and complementarity. This can be done as in the following: 

A = ( A\B) U (A n B) = ( B c n A) U (A n B). 

Then A c is obtained by taking the complementary and applying de Morgan’s 
formulas. The sets 0 and O are obviously obtained from A and A c by taking 
intersection and union, respectively. 

It is worth to note that the maximal recoverable information helds are 
not unique, but they have the same Cardinal. Other maximal recoverable 
information helds are Tb — {0 5 i> c , O}, yp c — {0, (7, C c , O}, Bahb — 
{0, A n B,(A n B) c , O}, Baab = {0,AAB,(AAB) C ,O}, where AAB = 
(. A\B ) U ( B\A ) is the symmetric difference, etc. 

Example 11.4.4 Let {Ej}j=i^ be a hnite partition of the sample space O. 
It is known that the 0-algebra generated by the partition consists of unions 
of elements of the partition 

N 

1 = {0, Ei, Ei U Ej, Ei U Ej U Ek , ..., (J E { = f2}. 

i —1 

Pick an element of the partition, say and consider the 0-algebra generated 
by it 

N 

T e , ={0,E l ,\jE i A). 

i =2 

Similar reasoning with the previous example shows that J 7 is a recoverable 
information. For maximality, see Exercise 11.10.5. 


Information Representation 


329 


The rest of the section deals with two results regarding the lost information. 

Propositiori 11.4.5 Let C denote the lost information. We have: 

(a) The output information, £, is included in a maximal recoverable infor¬ 
mation field if and only if the lost information is maximal, I — C. 

( b ) If the output information, £, includes a minimal non-recoverable infor¬ 
mation field, then C 7^ T. 

Proof: (a) “ ==> ” Let T be a maximal recoverable information field and 
assume £ C T. Then I\J r C I\£, and using the monotonicity of the operator 
© we get 

6(X\J r ) C 6(1\£). 

Since T is recoverable, &{T\fF) = X, and using that the right side is the lost 
information, we get IC£, which in fact implies equality, T — C. 

“ <^= ” Assume T — C, so T — ©(X\£). This means the output information £ 
is recoverable. Since it can be included in a maximal recoverable field (Zorn’s 
lemma), we obtain the desired resuit. 

(■ b ) Assume TL T £, with TL minimal non-recoverable information field. Then 
1\£ C 1\H. This implies ©(X\£) C &(Z\U). Therefore C C g X, 

and hence £^X. 

■ 

We show next that in general the lost information is non-recoverable, i.e., 
6(X\£) /X. 

Theorem 11.4.6 (Recoverable lost information) If the lost information 
is recoverable, then it is trivial, C — {0,0}. 

Proof: By contradiction, assume the lost information C is recoverable, namely, 

6(X\£) =X. (11.4.2) 

Taking the complement in the obvious inclnsion C — &{T\£) D T\£ implies 

1\C c X\(X\£) = £. 

Then applying again the operator © we obtain 

X= e(X\X) c &{£) = £, 

where the Lrst equality uses (11.4.2) and the last one uses that £ is a ©-field. 
By transitivity, Xcf, which is in fact identity, X = £. This implies that the 
lost information C — &{T\£) — {0, 0} is trivial. 

■ 

The previous resuit can be also restated equivalently by saying that no 
nontrivial information can be recovered from the lost information. 






330 


Deep Learning Architectures, A Mathematical Approach 


11.5 Information Representation Examples 

This section deals with characterizing the events that a few types of neurons 
and some simple networks can represent. These examples suggest that the 
more complex the input-output mapping is, the richer its associated sigma- 
algebra tends to be. 


11.5.1 Information for indicator functions 

The simplest nontrivial case of sigma-algebra is represented by indicator func¬ 
tions and their combinations. The following examples will be useful in later 
sections. 


Example 11.5.1 Let i C (1 be a measurable set and consider the indicator 
function X — Ia- Since 




then for any k E R we obtain 


1, if uj E A 

0, if u E A c 


| fl, if k > 1 

X _1 (—oo,fc] — {cj]X(u;) < k} = < A c , if 0 < k < 1 

I 0, if k < 0. 


The S-algebra generated by X is generated by the pre-images {X 1 (— oo, fc], 
k E R} and is given by &(X) — {0, A , A c , fi}. 


Example 11.5.2 Let A, B C fi be measurable and disjoint, A n B — 0. 
Consider the linear combination of indicators X — Ia + 2 1^. Then 


x(A 


2, if w E B 

1, if uo G A 

0, if uj 0 A U B. 


For any real /c, the half-lines pre-images through X are given by 


X 1 (— oo,/c] — <£;}=<( 


( O, 

B c , 

(dUB)' 

0 , 


i 


given 

by 

if k 

> 

2 

if 1 

< 

k< 2 

if 0 

< 

k < 1 

if k 

< 

0. 

A}, 

k 

G R} provides 


the information held generated by X: 


&(X) = {0, A, A c , B, B c , dUB,(dU S) c , fi}. 



Information Representation 


331 



Figure 11.6: Diagram for Example 11.5.3. 


Example 11.5.3 This is similar with Example 11.5.2, but under the hypoth- 
esis that the sets A and B are not disjoint. In this case we have 


X(uj) — 1 a{w) + 2 1b(uj) — < 


'3, if u E A n B 

2, if u) E B H A c 

l, if ujeAnB c 

0, if uo n (An B ) c . 


see Fig. 11.6. Then for any real k 


X 1 ( —oc , k\ — < 


fi, 

(dnB) c , 
B c , 

(dUB) c , 



if k > 3 

if 2 < k < 3 

if 1 < k < 2 

if 0 < k < 1 

if k < 0. 


Since (A U E>) c , (A n B) c E ©(X), then E ©(X). Using the 

properties of intersection and union, we show next that A belongs to ©(X): 



A n n = A n (b c u b) 

(.A n B c ) u (An B) 

(AuB)n B c ) U(AnB) £ e(X). 

' ee(x) 


7&{X) 


Since both A and B belong to the ©(X), it follows that 6 (Ia Ab) C ©(X). 
Since X(uj) — 1 a(A + 21 b(A is ©(1^, l#)-measurable, then ©(X) C 
©(1,4, l#), from which we conclude ©(X) = 0(1,4, Ib)- 

We note that a similar analysis can be applied to the linear combination 
X = od,4 + /31 b, with a ^ /3. 


Example 11.5.4 Let {Ei}i be a countable partition of fi and consider X = 
E i a A Ai , with distinet coefficients E R. It can be shown that ©(X) in 
this case is the set of countable unions of elements of the partition. 












332 


Deep Learning Architectures, A Mathematical Approach 



Figure 11.7: a. Information generated by a perceptron . b. Information gen- 
erated by a sigmoid or a linear neuron. 

11.5.2 Classical perceptron 

In this case the input variable is X : D -E R n , while the output is provided 
by Y — H(w T X + 0), where H is the Heaviside step function and w E R n is 
a fixed System of weights and 0 E R is a given threshold. Since 

Y(, A = I 1 ’ if wTx (v) + 0 > 0 f 1, if X(lu) € U+ fi 
K ’ \ 0, if w T X(u>) +e<0 ) 0, if X{u) € H~ fi , 

then the output can be written as the indicator function Y — Ia, with 
A = X~ l {UZ d ), where = { x;w T x + 0 > 0} denotes a closed upper 

half-space in R n with normal direction w. 

Consequently, using Example 11.5.2, the 0-algebra generated by the output 
Y is 

&(X) = {0,a,a c ,Q} = {0,x-\nZp,x-\nzp,n}, 

where we used that 

= (x-HKAY = {(K,e) c ) = X ^'(KA’ 

where 

H~ q = { x ; w T x + 9 < 0}. 

This can be fnrther written as 

S(F)=A'-'({0, 

It is worth noting that the S-algebra {0, Tdff 0 , fl} provides the infor- 

mation which classifies points above and below the hyperplane w T x + 0 = 0. 
This is the reason why a classical perceptron can be used to classify two 
clnsters, but no more than that, given the limited information it can accom¬ 
modate, see Fig. 11.7 a. 











Information Representation 


333 


11.5.3 Linear neuron 

Consider the input X : -G R n and the output Y — w T X + 0, where w G R n 
and 9 G R are given. We are interested in finding the information field &(Y). 
Since for any real number k we have 

0 = {w; Y(u) < k} = {a;; w t X(lo) + 6 < k} = X~ 1 (7i~ e _ k ), 

then 

e(Y) = 6(r 1 (- M ,fc);fceR) = e(r 1 (^j;^ R ) 

- X~ l {&(H~ e _ k \k e R)) = R)) 

- X-^efcweR)). 

The commutation between © and X _1 follows from Proposition A.0.1 of 
the Appendix. Since u is a parameter, the hyperplanes {w T x + u — 0} are 
parallel. Hence, the upper half-spaces T~L+ ^ are obtained by translating R+ 0 

a distance u along direction re, see Fig. 11.7 b. The field &('Hw,ui u ^ 
accommodates a lot more information than the one corresponding to the 
classical perceptron described in section 11.5.2. Containing inhnitely many 
parallel strips, it dehnes the information of a field of parallel hyperplanes in 
the space. 

11.5.4 Sigmoid neuron 

The input is given by the random variable X : -G R n and the output by 
Y — (j{w T X + 0), where a is the logistic sigmoid activation function, and 
w G R n , 6 G R are hxed. Using the monotonicity and invertibility of the 
logistic function, for any k G (0,1), we have 

T -1 (oo, k) = {ce; Y (ce) < k} — {ce; a(w T X( ce) + 6) < k} 

— {ce; w t X(cj) +9 < <r -1 (/c)} 

{ce, X(lo) G T~L w q_( j - 1 (£)} 

with u — a~ l {k) real parameter. The ©-held generated by Y is 

&(Y) = &(Y-\oo,k)-,k e [0,1]) =6(r 1 (^ a );ueR) 

- X- l (&{U-, u )\ueHL) =X- 1 (6(^+ u );«eR), 

where we used Proposition A.0.1. We have obtained the same ©-held as in 
the case of the linear neuron given in section 11.5.3, so all conclusions men- 
tioned there also apply here. Since this held is independent of the activation 


334 


Deep Learning Architectures, A Mathematical Approach 



Figure 11.8: Neural networks made of two perceptrons . Its output is given by 
Y = «i H(wJX + 6 \) + a 2 H{w%X + 0 2 ). 


function cr, we infer that all sigmoid neurons can learn the same amount of 
information as long as the activation function a enjoys all previous used prop- 
erties. Conseqnently, a sigmoidal neuron can learn a lot more than a classical 
perceptron. Note that the nonlinearity of the activation function a does not 
have any impact on the information prodnced. We shall see that this will 
no longer hold for a network of this type of neurons, where the nonlinearity 
plays an important role. 

11.5.5 Aret angent neuron 

This is a particular case of a sigmoid neuron, which will be used again in 
section 11.5.8. The output variable is given by Y — tan ~ 1 (w T X + 6) and 
takes values in (—tt/2, tt/ 2) . The pre-image of a half-line is 

Y~ 1 (-oo,k) = {Y<k} = {w T x + e <t^k} = x-\n- e _ t ^ k ) 

= x-\n~p, 

with wgM real parameter. We have, like in the previous section, 

6(y) = X- 1 (6{^-„;«EM}). 

11.5.6 Neural network of classical perceptrons 

Consider a neural network with one hidden layer that has two units, each 
being a perceptron with a step activation function, see Fig. 11.8. There are 
two sets of weights: from the input to the hidden layer, w\,w>2 G R n , two 
thresholds, #i, 62 G M, and two weights from the hidden layer to the output 
layer, 01,02 G R. We are interested to describe the information represented 
by this network. 






Information Representation 


335 



Figure 11.9: Information provided by a network of two classical perceptrons . 


We shall write the outcome as a linear combination of two indicator func- 
tions: 


Y(co) = a 1 H(w{X(Lu) + d 1 ) + a 2 H(w%X(cj) + Q 2 ) 

= ail{ w T X ( a ,) +6 i 1 > 0 } + a 2l{ u ,Tx(oj)+6»2>0} 

= „ )+ Ol2l x -i(^H+ ) 

v ' v w 2’®2' 


ol\ Iai + <^2 In 


2 5 


where Ai — X 1 (fH~ t. ^.)) is the pre-image of an upper half-space through X. 
Assume ol\ ^ « 2 - Since we are in the conditions of Example 11.5.3, we have 


6(y) = 6(U 1 ,l A2 )=X- 1 (6(V ,v „ ))• 

™2^2 


The sigma-algebra ©(1-,+ , 1~,+ ) contains the upper and lower half- 

ioi,6»i ^ 2,02 


spaces, Rf , z = 1, 2, and all their combinations obtained by inter- 

section and union, see Fig. 11.9. In particular, since it divides the space into 
at most four disjoint regions, the network can classify at most four clnsters 
of points. This information is definitely richer than the one obtained in the 
case of a single classical perceptron. 

When the number of units N in the hidden layer increases, the information 
structure gets more complex; this can be used to classify a given number 
of clusters. For a space of arbitrary dimension n this number might not be 
straightforward to obtain, but for n — 2, the maximal number of clnsters that 
can be classified, i.e., the maximum number of regions obtained, is given by 
the formula N(N + l)/2 + 1, as it can be shown by induction over N. 

Before attempting to hnd the information produced by a network of sig- 
moid neurons, we need first the following preparation. 


336 


Deep Learning Architectures, A Mathematical Approach 



a 


b 


Figure 11.10: a. Triangle as a union of rectangles. b. Half-space as a union 
of infinite rectangles. 


11.5.7 Triangle as a union of rectangles 

Consider a triangle in the plane given by T = {x > 0, y > 0, x + y < k}, see 
Fig. 11.10 a. For any c G (0, k) the rectangle R c — [0, c\ x [0, k — c\ is inscribed 
in the triangle. When the parameter c varies from 0 to /c, the rectangle union 
covers the entire triangle T — [_J i7 c - By density reasons we may assume 

0 <c<k 

that c takes only rational valnes, c G Q, case in which the union becomes 
countable 

T= U Rc- 

0 <c<k 
ce Q 

The results stili holds even if conditions x > 0, y > 0 are dropped, case in 
which the half-plane can be written as a countable union of infinite rectangles, 
see Fig. 11.10 b: 

{x + y < k} = ^{x < c} fl {y < k — c}^. 

0 <c<k 
ce Q 


11.5.8 Network of sigmoid neurons 

Consider a neural network with one hidden layer that has two units, each 
being a sigmoid neuron with an activation function <r, see Fig. 11.11. The 
weights from the input to the hidden layer are denoted by wi,W 2 G R n , and 
the thresholds by 61,62 G R. The two weights from the hidden layer to the 
output layer are « 1 , 0^2 G R. The output variable Y can be written in terms 
of the input random variable X as 

Y = aia(wfX + 6 \) + ct2cr(w2 X + 62). 





































































































Information Representation 


337 



Figure 11.11: Network of two sigmoid neurons. 


We are interested in finding the information generated by this network, which 
is the sigma-algebra &(Y). 

To get a glimpse of the complexity raised by the nonlinearity of <r, we 
shall produce an explicit computation for the case of an arctangent neuron 
with ai — 0 L 2 — 1. The output in this case is given by 


Y — tan 1 (wJX + 9\) + tan ^(w^X + 62 ) = u\ + U 2 . 


Using a trigonometric formula, we write 


tanT = tan(^i + U 2 ) 


tan u 1 + tan U 2 
1 — tan 14 tan 1 x 2 


(wjx + 6 \) + {w^x + 62 ) 
1 — (wjx + 0 i)(w 2 X + O 2 ) 


The output 


Y — tan 1 


(Wi + W 2 ) T X + 61 + 62 ) 


1 — (w^x + 0\){W2 X + 62 ) 


T, 


is a lot more complex than the output of a single arctangent neuron, which 
is tan ~ l {w T x + 0 ). 

In the following we shall attempt to describe the sigma-field &(Y). First 
we shall find a System of generators. Fix ol\,ol 2 > 0. Using Example 11.5.7, 
we can write for any real k 


= {a\a(wJX + 0 i) + a 2 cr(w 2 X + 62 ) < k} 

= [J + 0 i) < c} fi + 62 ) < k — 


{Y < k} 









338 


Deep Learning Architectures, A Mathematical Approach 


[J ({a{w\X + 9 i) < c/a 1} fl {a{w 2 X + d 2 ) < (k — c)/a2} 

c<k 

[J ^ jwfx + 61 < <j~ l } n | wjx + 62 < cr -1 ^ 

c<k ai 


k — c 

OL2 


- u 


c<k 


\x <eU n _i,c „ ,, fc _ n l 

l w 1 ,e 1 -a 1 ( — )t 1 W 2 fi 2 -a-i 


= II x~ l (n~ , Jn x- l {u~ , A 

V Wi,Ui(c)/ V W 2 ,U 2 [C )/ 


W2,U2(c) 


c<k 

cGQ 


= X 


1 


u 

c<k 


H ( \ n T~L ( \ 

Wl,Ul{C) W2,U2(C) 


cGd 


with u\(c) = 6 \ — a 1 ( —) and 1 x 2 (c) — 62 — (J 1 (^- g ). In the last identity we 
used Proposition A.0.1. Note that the intersection of half-spaces 'H~ l Ul ^ H 
T~L~ , x contains different sorts of strips and angle sectors. The information 

kU 2 i'U , 2 

generated by the output Y is the sigma-algebra generated by the previous sets 

6(F) = 6{F < k;k€ R} = X-‘(6{ U«„,„ lW n«„,„ W })' 

c<k 
ce Q 


This sigma-held contains a lot more information than ali the other out¬ 
put information fields encountered so far. This behavior is dne to a linear 
combination of nonlinear sigmoids. Furthermore, the information complexity 
depends on the weights w^oti and thresholds 6^, and therefore a maximum 
capacity learning algorithm should be able to tune these parameters such 
that the output information &(Y) becomes maximal. 

11.5.9 More remarks 

A neural network with multiple hidden layers combines iteratively nonlinear 
combinations of sigmoids. The resuiting output has an even more nonlinear- 
ity, fact that leads to a reacher output information 6 (Y). However, there is 
an upper bound of all these improvements, since all these sigma-helds are 
contained into the input information, ©(T) C ©(X) = I. The role of a neu¬ 
ral network is to adapt the input information X into an output information £, 
hoping that eventually the target Z becomes £-measurable, or, equivalently, 







Information Representation 


339 


Z is determined by information £. Since this reduces to Z C the network 
weights have to be tuned to adjust £ to the right maximal size. However, 
since sometimes Z is not a subset of X, the best we can hope for is just the 
inclusion 2fll C £. 

The massage of the input information I into the output information £ 
is done composing the input X by a measurable function / to obtain the 
output Y — f(X). The function / is highly nonlinear and depends on the 
network weights and biasses. 

If the neural network is regarded as a human mind and the sigma-algebra 
X is the information collected by senses, the measurable function / plays the 
role of a thought in the brain, which structures the information into some 
understandable information £. 

Two minds supplied with the same input information might understand 
nonidentical things due to distinet thought functions /, which reduces to 
having different synaptic weights. There is some information lost, X, which is 
the difference between of what we can see, X, and what we can understand, 
£. The learning process comes with producing more sophisticated thoughts, 
/, which achieve an expanded understanding, £. 

A neural network can be compared also with an engine, which is supplied 
some input energy and which produces some output work. At the output the 
engine produces always less energy than it is supplied with. A mechanic’s job 
is to adjust the engine parameters to make it more efficient by increasing its 
capacity; in a similar way the weights are tuned in a neural network to mateh 
a certain target information. 

11.6 Compressible Layers 

In this section we discuss how the flow of information propagates through the 
layers of a deep feedforward neural network. In general, each internal layer 
does a certain classification job, by dividing the incoming information into 
pieces and reorganizing them into new information. The next layers perform 
a similar job, condensing the information into larger chunks. 

Since there are presumably some events in the input data field, which 
are not relevant to classification, they can be dropped without consequence, 
while dropping the others has severe consequences. The selective process of 
removing the useless events and keeping only the necessary useful minimum 
cannot be done usually in only one layer. This process requires several lay¬ 
ers, each layer dropping selectively certain events, which generate the lost 
information at that specific layer. 

We consider now a deep feedforward neural network with L — 1 hidden 
layers and neurons in the f?th layer. Recall that the network input is 



340 


Deep Learning Architectures, A Mathematical Approach 


an n-dimensional random variable, X — X^ 0 ) = (x^, ..., x^ 0 ^) T and the 
activation of the £th layer is a d^-dimensional random variable, denoted by 
iW = (xf\ ..., x^) T . The neural network’s output is given by the random 


vector Y — — ( x[ L \ ..., x ^ L) ) T . 


The information held generated by the activation of the £th layer is given 
by the ©-algebra X^ — ©(X^). In particular, the input information is 
X = X, and the output information of the network is X^ — £. We have seen 
that the inclnsion £ C X holds always, and C — &(X\£) is the information lost 
in the network. We shall apply a similar approach to each layer of the network. 

The first hidden layer is fed the information Xand it produces the 
information X^\ which is the ©-held generated by the activation X^\ Since 
X W depends on the input X^ 0 ), then ©(X^ 1 )) C ©(X®), i.e., X^ C X 
The ©-algebra C^ — &(I^\X^) is the lost information through the hrst 
hidden layer. 

We shall apply the same construction for each layer. The information that 
goes into the £th layer is the information generated by the activation of the 
previous layer, X^ -1 ), and is given by X^ -1 ). The information that comes 
out of the £th layer is generated by the activation X^ and is given by X^\ 
Since X^ depends on input X^ _1 ) through the forward propagation relation 
(6.2.29) 

= <p(w {e)T X^~ l) - 


then &(X e ) C ©(jW- 1 )), or 1& C Hence, we obtain a descending 

sequence of sigma-algebras: 


X (L) C • • -jW C X^ _1) C • • • C X (0) . (11.6.3) 


The difference contains the events that have been dropped by the 

£th layer. This difference is not a sigma-algebra, but it generates one. The 
size of this sigma-algebra describes the compressibility of the £th layer. 

Definitiori 11.6.1 (a) The lost information in the £th layer is given by the 
sigma-algebra 

= 6(X (£_1) \X w ). 

(6) The £th layer is called uncompressed if the lost information is trivial, 
£0 = 

(c) The £th layer is called totally compressed if the lost information is maxi- 

mal, i.e., = X^~ l \ 

(d) The £th layer is called partially compressed if {0, 0} ^ ^ X^~ l \ 


Remark 11.6.2 The definition implies: 


Information Representation 


341 


(■ i ) If the £th layer is uncompressed, then 1^ ^ = X^\ namely, the informa- 
tion field remains unchanged. 

(ii) If the £th layer is totally compressed, then 1^ — {0,0}. The inclusion 
sequence (11.6.3) implies = {0,0}, for all £ < p < L. This makes the 
rest of the layers, from £ + 1 to L, useless. 


We shall make in the following the relation between the information helds 
X, £, and Proposition 11.3.3 provides the decomposition 

x {t-i) = u jM) = v 

Iterating with £ = 1, 2,..., L, we have 

£ (1) VI (1) 

£ (2) VX (2) 

• • • 

£WvlW. 

By backward substitution, we obtain a representation of the input infor¬ 
mation field in terms of all lost information in the layers and the output 
information as 

L 

I=\J d' () V £. (11.6.4) 

t= 1 

We note that we have applied the associativity property of the operator V, 
see Exercise 11.10.10. Comparing with X — £ V £, and using the maximality 
of £ , see Proposition 11.3.3 part (6), yields 

L 

V £ w c £. (11.6.5) 

t=l 

This means that the total information lost in a network, £, is at least as large 
as the cumulation of information lost in each of the network layers. 

The next resuit is a conseqnence of the previous relation. 

Proposition 11.6.3 If the lost information in a neural network is trivial, 
then all of its layers are uncompressed. 

Proof: Since C — {0,0}. Formula (11.6.5) implies that \ff = i — {0,0}, 
and hence for each £ we have C {0,0}, which actually implies eqnality 
£(*) = { 0 , 0 }. ■ 


j(°) = 

X (1) = 

• • • nzzz 

X( L ~ l ) = 


342 


Deep Learning Architectures, A Mathematical Approach 


11.7 Layers Compressibility 

We shall present necessary conditions for a layer to be uncompressed. We 
shall start with some background preparation. 

We assume the £th layer is uncompressed, i.e., — {0,0}. This means 

the incoming and outgoing information are equal, Z^ _1 ) = or in terms 
of sigma-algebras, ©(X^ -1 )) = ©(X^)). This identity occurs when and 
x^- 1 ) can be expressed in terms of each other, i.e., there are two measurable 
functions F and G such that — Z(X^ -1 )) and X^ _1 ) = £?(X^), see 
Proposition D.5.1 in the Appendix. We shall investigate the existence of 
functions F and G. 

The existence of F follows from the forward pass formula (6.2.29) 

X® = </>(wW T xV-V - 50. (11.7.6) 

Therefore, it suffices to state necessary conditions for inverting this relation. 
This is done by the next resuit. 

Theorem 11.7.1 (compressibility conditions I) Assume the following 
conditions are satisfied: 

(i) The activation function <f is invertible; 

(ii) There is the same number of incoming and outgoing variables for the ith 
layer, i.e., d^~ 1 ^ — d^; 

(iii) det 0 . 

Then the ith layer is uncompressed. 

Proof: Since <f> is invertible, relation (11.7.6) becomes 

w (t) T X^~ l ) = + 5 ( V 

Since = d^\ then W^ is a square matrix of nonzero determinant, so it 

is invertible. Therefore 

= vyO T_1 (0- 1 (A’O) + 50, 

or X^ _1 ) = G(X^), with G measurable. Hence Z^ _1 ) C Z^. Since the 
converse inclusion is trivial, it turns out that Z^ _1 ) = T^\ i.e., the iih layer 
is efficient. ■ 

Remark 11.7.2 Condition (i) is satisfied by all increasing activation func¬ 
tions, such as the logistic function, hyperbolic tangent, arctangent, softplus, 
etc. However, the invertibility condition is not satisfied by ReLU, unit step, or 


Information Representation 


343 




Y = X 



Figure 11.12: A two-layer neural network with 

bipolar activation functions. Also, pooling functions are not invertible, since 
there are not one-to-one; therefore pooling provides compression, and hence, 
loss of information. 

Example 11.7.1 (Two-layer network) Consider a neural network with 
no hidden layers, the only layers being the input and the output layers, 
see Fig. 11.12. Assume the dimension of the input equals the dimension 
of the output, i.e., d ® = d d) = n. Denote the weight matrix by W d) — 

( w ij^)i<ij<n and the bias vector by B d) = (b^\ ..., b$) T . We also assume 
that (j) is increasing. The output is given by 

xW = 0(vf (1 ) t x(°) -bW). 

The network is uncompressed as long as det VF^ 1 ) ^ 0. This roughly States 
that the outputs are functional independent of the inputs. Next section deals 
with this topic in more detail. 


11.8 Functional Independence 

As usually, 1 \ ..., Xn and — (x^\ ..., xffl) denote the 

activations of layers £ — 1 and £, where we make the assumption that these 






















344 


Deep Learning Architectures, A Mathematical Approach 


layers have equal dimensions, ^ — n. Consider the determinant of 

the Jacobian 


f) t V 

A W=det' J 




dx) 


(t-i) 


h3 


The activations ^ and are called functional independent if A^ 7 ^ 0. 
Using the formula — YliLi ^ w if x f ^ \ the elements of the Jacobian 


can be computed as 


dx K j 


(t) 


d<p{sf) 


dx) 


(t-i) 


dx) 


(t-i) 


,/( (t)x OS j 

<t> (Sj ') 




dx) 


(*-i) 


,// (i) 

<t> ( s ) ) w u 


IJ 


Applying the determinant yields 


A^ = det (fi' (s^)w^^j — fi' (s ^)... fi' (sffl) det 


(11.8.7) 


where we used the multilinearity property of determinants. If the activation 
function, </>, satishes fi' 7 ^ 0 , then A^ = 0 if and only if det = 0 . 

Remark 11.8.1 The condition fi' 7 ^ 0 is satished by most sigmoid-type acti¬ 
vation functions. For instance, if (f) — a is the logistic function, then 


n 


= JJ(t(T )(1 — <j(sP )) det 




i =1 


If fi — t is the hyperbolic tangent, then 


n 


A o = jj (1 - t 2 (T)) det w(e) 


i —1 


We notice that in these cases we have the inequality 


A w < det W {e) 


The next resuit is a reformulation of Theorem 11.7.1: 


Theorem 11.8.2 (compressibility conditions II) Assume the following 
conditions hold: 

(i') The activation function satisfies fi' > 0; 

(ii') There is the same number of incoming and outgoing variables for the £th 
layer, i.e., d^~ 1 ^ — d^; 

(iii') X^ is functional independent of X^~ l \ 

Then the tth layer is uncompressed. 










Information Representation 


345 


Proof: Condition ( i') implies that <f> is both increasing and has nonvanishing 
derivative, (j) f ^ 0. From condition (ni') we infer ^ 0, and since <f>' ^ 0, it 
follows that det ^ 0. Applying Theorem 11.7.1 yields the desired resuit. 

■ 

A variant of the previous resuit covering the case of a decoder layer , i.e., 
the case when layers increase in size, is given next: 

Theorem 11.8.3 (compressibility conditions III) Assume the following 
conditions are satisfied: 

(i) The activation function <f is invertible; 

(ii) The number of outgoing variables in the ith layer is greater than the 
number of incoming variables, i.e., d^ > d^~ l \ 

(iii) has maximal rank, i.e., rankW ^ = d^~ l \ 

Then the lost information in the ith layer is trivial, i.e., the ith layer is 
uncompressed. 

Proof: Using the invertibility of the activation function <f >, relation (11.7.6) 

hprnrnps 

w {l) T X^- 1 ) = + B (l \ 

This can be considered as a linear system with d^ eqnations and d^~ 1 ^ 
unknowns. Since rankW ^ = rankW ^ = d^\ by an eventual reindexing, 
we can solve for X^f ^, x!f 1 \ ..., X^}}) in terms of ,..., . This 

implies 

= &(X {e ~ 1) ) C C S(V W ) = X ( g (11.8.8) 

where the first and last of the previous identities resuit from the dehnition 
and the last inclusion is obvious. By transitivity it follows that c 

Since for feedforward neural networks we always have T^ C the 

double inclusion implies the identity T^~ 1 ^ — T^\ Therefore the 7th layer is 
uncompressed. 

■ 

We shall discuss next the case of an encoder layer. This means d^~ 1 ^ > 
d^\ i.e., the 7th layer has fewer neurons than the previous layer. Then 
can be written in terms of X^~^ as — G(X^~^), with G : R^ 

R^ (£) measurable, so C T^~ x \ Giving the inequality of dimensions it 
follows that G is non-invertible, 2 the previous inclusion being striet, fact 


2 For instance, the function G : R 3 —>• R 2 , G(xi,X 2 ,xs) = (xi,X 2 ), is not invertible. 



346 


Deep Learning Architectures, A Mathematical Approach 


that implies a compression in the £th layer. A neural network satisfying the 
sequence of inequalities 


d (0) > d (1) >■■■> d 1 - 1 > d {L) 
exhibits compression effects in each of its layers. 

11.9 Summary 

This chapter formalizes in language of sigma-algebras the concept of Infor¬ 
mation in a neural network, having the inpnt and target given by random 
variables. We introduced the concepts of input, output, and target informa- 
tion fields and discussed their properties and relation to learning. 

The concept of lost information in a neural network describes the con- 
tent of information that cannot be recovered when passing forward through 
the network. It can be also dehned on individnal layers and used to dehne 
compressed and uncompressed layers. We provided results regarding the com- 
pressibility of encoder and decoder layers. An encoder layer compresses the 
information, the lost information being nontrivial, while a decoder layer is 
always uncompressed, having a trivial lost information. 

We discussed the information representation for several examples involv- 
ing neurons and simple neural networks, emphasizing the geometric inter- 
pretation. These examples include the perceptron, sigmoid neuron, linear 
neuron, and aret angent neuron. 

11.10 Exercises 

Exercise 11.10.1 Let c be a constant and A be a real-valued random vari- 
able. Show that: 

(a) 6 (c) = { 0 ,fl}; 

(b) e(c + X) = 6 (X); 

(c) &{cX) = &{X) for c + 0. 

Exercise 11.10.2 Consider a sequence of random variables (X n ) n and define 
two more sequences, (Y n ) n and (Z n ) n , by the following rules Yo = 0, Y n = 
Zr=i Xi and Z 0 = 0 ,Z n = Y t . 

(a) Show the striet inclusion &(Y n ) C &(X \,..., X n ); 

(b) Show the identity of sigma-fields &(Y \,..., Y n ) — 6 (Xi> • • •, X n ); 

(c) Prove that Z n = nX\ + (n — 1)^2 + • • • + X n and &(Z \,..., Z n ) = 
&(Xi,.. .,X n ). 



Information Representation 


347 


Exercise 11.10.3 (The 6-field generated by two sets) Show that the 
smallest 6-field that contains two measurable sets, A and B has 16 elements 
and it is given by 

6(71, B ) ={0, A, B , A c , B c ,AnB,AU B , A c n B c , A c U B c , 7L C n B, A n B c , 
tl c u s, 71 u s c , (A n B c ) u (tl c n s), (71 n s) n (A n B ) c , ft}. 

Exercise 11.10.4 How many elements does the 6-field generated by three 
sets have? 

Exercise 11.10.5 Let {Ej}j= yw? Af > 3, be a hnite partition of the sample 
space Q. The 6-field generated by the partition is given by 

N 

X — {0, , Ei U Ej , U Ej U E ..., |^J E^ = O}. 

i=l 

Denote by E^, Te 1 ,e 2 the 6-helds generated by Ei and {Ei,E 2 }, respec- 
tively. Show that: 

(а) J 7 e 1 ,e 2 is a recoverable information; 

(б) J~e 1 is not a maximal recoverable information. 

Exercise 11.10.6 Consider the input of a neural network given by the n- 
dimensional random variable X — (Xi,...,X n ) on the measurable space 
(). Prove that the input information &(X) is the 6-algebra generated 
by events of the form 

k 

P|{w;Xi(o;) < Xi} 

i=1 

with x \,..., Xk G M, k > 1. 

Exercise 11.10.7 Assume that we drop one neuron from the last layer of 
a neural network. This means that we erase all synapses related to that 
particular neuron. Let Y be the output of the initial network and Y the 
output after the neuron dropout has occurred. Let £ — &{Y ) and £ — &(Y) 
be the output information and C and C be the lost information in each case. 

(a) Show that C C C. 

( b ) Formulate a similar resuit for the case when the neuron is dropped from 
the ^th layer. 

Exercise 11.10.8 Assume that we add an extra neuron to the last layer of a 
neural network. Let Y be the output of the initial network and Y the output 
after the neuron addition has occurred. Let £ — &(Y) and £ — &(Y) be the 
output information and C and C be the lost information in each case. 



348 


Deep Learning Architectures, A Mathematical Approach 


o 

o 

o 


\ * , * 

o \ 

o o\ 

O o 

G ° \ rt 

o o °V* ' 


* 

* * 
* * 


o o o 


o 

o 

o 

o 


. o 

O /\o 


o 
o 

o o 
, o o 

o / f A° 

/ Cc > 


/.ws 

o/ 


o 


•* :<••• 


/ w . ... &... * 

/ -«f K«- VG Jl ‘c., , \ 

i.; -> • jnt \ 


O 


O 

O o 


o 

o 


o o 


o o 

o o 
o o 
o 


o 


o 


o 

o 


o 


o 


•••:= :<••• 


y'# 


vV 

•••s; 

. jPWfti 






T 

•<V>V 


W 


%'• >>>■« 


O 

O o 

. 9 o 

' o 

o 
o 

o 


■•;: S 


o o 


.o 


o 


o 


o 


a 


b 


c 


Figure 11.13: a. The clusters can be separated by a line; b. One cluster is 
inside of a triangle and the other is outside; c. One cluster lies inside of a 
rectangle. 



Figure 11.14: a. For Exercise 11.10.11. b. For Exercise 11.10.12. 

(a) Show that C C C] 

( b ) Formulate a similar resuit for the case when the neuron is added to the 
£th layer. 

Exercise 11.10.9 The clusters given in Fig. 11.13 a, b, and c are subject to 
be classified by a neural network with one hidden layer. What is the minimal 
number of neurons that need to be used in the hidden layer for each of the 
aforementioned cases? 

Exercise 11.10.10 Show that operator V on sigma-helds is associative, i.e., 
for any three sigma-helds, J 7 , Q , 4?, we have 

(Fvg)VK = FV(evK). 

Exercise 11 . 10.11 Show that the hidden layer of the neural network given 
by Fig. 11.14 a is efficient, i.e., X 







Information Representation 


349 


Exercise 11.10.12 Consider the one-hidden layer autoencoder given in 
Fig.11.14 b. Show that X^ — X^\ 

Exercise 11.10.13 Let X, £, and C denote the input, output, and lost infor- 
mation in a network. Show that: X — £ if and only if C — {0,14}. 

Exercise 11.10.14 Let X and £ denote the input and the output information 
fields of a network. Prove that it is not possible to place £ between a maximal 
recovery information field, J 7 , and a minimal non-recovery information field, 
T~L, namely, we cannot have the double inclnsion XL C £ C J 7 . 

Exercise 11.10.15 Let I and £ denote the input and the output information 
fields of a network. Show the following: 

(i) If £ is contained into a maximal recoverable information field, then £ does 
not contain any minimal non-recoverable information fields; 

(ii) If we assume that £ contains a minimal non-recoverable information field, 
then £ is not contained into a maximal recoverable information field. 

Exercise 11.10.16 Show that any subfield of a recoverable information field 
is also recoverable. 


® 

Check for 
updates 


Chapter 12 

Information Capacity 
Assessment 


This chapter deals with one of the main problems of Deep Learning, namely, 
how can a neural network go from raw data to a more complex representation 
as the data flows through the network layers? The organization of pixels into 
features can be assessed by some information measures, such as entropy, con- 
ditional entropy, and mutual information. These measures are used to describe 
the information evolution through the layers of a feedforward network. 

If the previous chapter provided a qualitative description of the informa¬ 
tion content through the layers of a neural network, this chapter deals with 
a quantitative description of the evolution of the information in a neural 
network. 

The reader interested in a more detailed presentation of these topics is re- 
ferred to the book [11]. The applications to neural networks include dehnition 
and compntation of network capacity and the information bottleneck method. 

The previous chapter described the process of compressing the informa¬ 
tion through the layers of a network. In this chapter we shall introduce a mea- 
sure of assessment of the compressibility of a layer using mutual information. 


12.1 Entropy and Properties 

The entropy describes the information organization of a System. It is max¬ 
imum in the case when the System is completely uncertain. For instance, a 
picture with pixels represented by white noise will have maximum entropy, 
while a picture of a geometrical figure will have a smaller entropy. Raw data 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_12 


351 



352 


Deep Learning Architectures, A Mathematical Approach 


inputs of a neural network have a larger entropy than its output complex 
representation. 

For instance, we shall consider the example of a neural network, which 
is fed data representing pictures of cats and dogs, and has to figure out 
the animal type. The entropy of the output layer is 1 bit, as the information 
needed to know is one out of two possible choices. However, each input picture 
has a larger entropy. The task of the neural network is to gradually decrease 
the entropy until the entropy of the output is just the correct one (namely, 
one bit), and hence the interest in how the entropy decreases along the layers 
of a neural network. 

The next subsections introdnce the concept of entropy and present some 
of its properties. Then, we shall approach the study of expressionless layers 
using the concept of entropy. 

12.1.1 Entropy of a random variable 

Let X be a discrete random variable taking valnes {xi, X 2 ,..., x n } with prob- 
abilities p^ — P(X — x &). The negative log-likelihood, — lnp^, represents the 
information that X takes the value x & and its mathematical formulation is 
discussed in section D.7 of the Appendix. Then the entropy of X is defined 
by the sum 

n 

H(X) =-^pklnpk =E p [-\np\, (12.1.1) 

k=i 

where W represents the expectation operator with respect to the distribu- 
tion p. Relation (12.1.1) is a weighted average of the negative log-likelihood 
function and represents the expected amount of information contained in the 
random variable X having the probability distribution p — (pi,... , p n ) T . 
Since pk E (0,1), then H(X) > 0. 

In the case when the random variable X is continuous, its distribution is 
given by the density p(x). This is usually written heuristically as 

P{x < X < x + dx) — p(x)dx , 

i.e., the probability that X takes valnes in the inhnitesimal interval (x,x + 
dx) is proportional to its length, dx, and this proportionality function is 
the density p(x). In this case the probability measure of X is absolutely 
continuous with respect to the Lebesgue measure, and is given by /i(dx) = 
p(x)dx, see Appendix, section C.7. If the range of valnes of X is denoted by 
T, the associated entropy is defined by 

' p(x) lnp(x) dx — E p [— lnp]. 
x 


H(X) = H(p ) 


( 12 . 1 . 2 ) 


Information Capacity Assessment 


353 


Relation (12.1.1) defines the discrete entropy while (12.1.2) provides the dif- 
ferential entropy. One major difference between them is that the latter might 
be, sometimes, negative, or infinite. Some sufficient conditions for bounds on 
entropy are provided in Exercise 12.13.2. 

Similarly, we can define the joint entropy of two continuous random vari- 
ables X and Y as 

H(X,Y) = — / / p(x, y) Inp(x, y) dxdy. 

J M J M 

This can be interpreted as the information contained in the pair of random 
variables (X, Y). The corresponding relation for the entropy in the case of 
two discrete random variables is 

H(X,Y) = -'£ fPij kip ij , 

h3 

where pij — P{X — xi, Y — yj) is the joint distribution of (X, Y). 

12.1.2 Entropy under a change of coordinates 

This section shows that the entropy is coordinate dependent. For a better 
understanding, we shall start with an example. 

Exercise 12.1.1 We assume that X is a random variable describing the pixels 
activation of a gray-scaled picture with dimensions m x k. Let p^ denote the 
activation of the pixel with position (i, j), where 1 < i < m, 1 < j < k. We 
assume 0 < p^ < 1, with valne 0 corresponding to white pixels and value 
1 to black pixels. We can also assume the normality condition ^ijPij — 1 
satisfied, which can be obtained by dividing each p^ by the sum. Thus, p^ can 
be regarded as a probability distribution. The entropy of the image is given by 

H(x) = Pij ln Pij. 

fj 

Now, we consider the picture obtained by applying a transformation, /, to the 
former picture, obtaining Y — /(X). How does the entropy change? If the 
transformation is a rotation, a translation, or a flip the amount of information 
in the picture should not change, so H{Y) — H(X). However, if the transfor¬ 
mation is not rigid (but stili smooth and invertible), the picture’s information 
changes. For instance, you can deform continuously a picture of a circle into a 
picture of a square, changing the information, and hence, the entropy. 

The next resuit provides a formula for the change in the entropy of a 
continuous random variable under a smooth coordinate transform. 


354 


Deep Learning Architectures, A Mathematical Approach 


Propositiori 12.1.2 Let X,y C R n be two domains. Let X and Y be con- 
tinuous random variables on X and y , respectively, with Y — f(X), where 
f : X —> y is a differentiable function having a nonsingular Jacobian, 
det Jf(x) 7^ 0, with 

V dx k Jj,k 



Then 


H(Y) = H(X) - E Py [ln | det J f -i(Y)\]. 


(12.1.3) 


Proof: Let px and py be the densities of the random variables X and Y , 
respectively. It is known that if Y — f(X ) then the relation between their 
densities is 

Px{x) = py(f{x)) \ det J f (x )|. 


This formula can be found in any book of elementary probabilities, such as 
in Wackerly et al. [120], Chapter 6. Using the change of variables y — /(x), 
the formula dy — \ det Jf(x)\ dx, and the fact that the transformation / is 
invertible (since the Jacobian is nonsingular), we have 

H(X) = - / Px(x) ln px(x) dx 

J x 

= - j Py (/0*0)I det J/( x) \ ln ^py(f(x))\ det Jf(x) dx 


= - J Py (y) ln (py (y) 


det Jt -1 ( y) 


dy 


= ~ Py(v) lnpy(y) dy + / p Y (y) ln (| det J f -i (y)|) dy 

Jy Jy 

= H(Y) + E Py [ln | det J f -i(Y)\\, 

which ends the proof. 


The next consequence States that the entropy is invariant under rigid 
transforms of the plane. It follows from the fact that the determinant of 
these transforms is equal to ±1. 


Corollary 12.1.3 Let X and Y be random variables on M 2 . iff : R 2 —» M 2 
is a rotation, a flip into a line, or a vector translation, and Y — f(X), then 
H(X) = H(Y). 


In order to asses the change in the entropy we need to study the expec- 
tation term E Pv [ln | det Jf-i(Y)\\. We shall introdnce first a few notions. 


Definitiori 12.1.4 A mapping f : W 1 -X W 1 is called a \-contraction if there 
is a number 0 < A < 1 such that for any compact domain K C R n we 
have vol(f(K)^J < Xvol(K), where u vol” denotes the n-dimensional Lebesgue 


measure. 















Information Capacity Assessment 


355 


Contractions often occur in neural networks when describing informa- 
tion compression, such as pooling, when pixels in a given square region are 
replaced by the pixel of maximum valne. The geometric interpretation of the 
determinant of the Jacobian matrix in terms of contractions is given in the 
following resuit. 


Propositiori 12.1.5 Let f be a mapping from R n to R n and A > 0. There 
are equivalent: 

(i) | det Jf(x) | < A for ali x G R n ; 

(ii) f is a X-contraction. 


Proof: (i) (ii) It follows from an application of the change of variable 
formula 

vol(f(K))= / dy — | det Jf(x)\ dx < X / dx-Xvol(K). 

J f(K) JK JK 

(ii) (i) Assume there is an xq such that | det Jf(x o)| > A. By continuity 
reasons the inequality can be extended on a neighborhood of xq. Let K be a 
compact set inclnded in this neighborhood, so | det Jf(x) | > A, for all x G K. 
Then 

vol(f(K))= / dy= | det Jf(x)\ dx > X / dx — Xvol(K ), 

Jf(K) JK JK 

which contradicts the defmition of the A-contraction. Therefore, the assump- 
tion was false and hence | det Jf(x o)| < A everywhere. ■ 

The next proposition States that under a A-contraction the entropy 
decreases. 


Proposition 12.1.6 (Entropy change) Let X andY be given as in Propo¬ 
sition 12.1.2, and assume the mapping f is a X-contraction, with X G (0,1). 
Then the entropy change, H(X) — H(Y), is positive. A lower bound is given 
by 

H(X)-H(Y) > ln t 

A 

Proof: Using Proposition 12.1.2 we have 


H(X)-H(Y) = E Py_ [ln|det J f -i(Y)\]. 


Using Proposition 12.1.5 and flipping the inequality 0 < | det Jf(x)\ < A < 1 
yields \det Jf-i(y)\ > i > i, and hence ln | det Jf-i(y)\ > 0. Therefore, 
K Py [ln | det J f -i(Y)\\ > 0. More precisely, 

E Py [ln | det Jf-i (Y) |] = J p Y (y) ln | det Jf-i (y) \ dy > ln ^ J p Y (y) dy 


— ln 


A 


















356 


Deep Learning Architectures, A Mathematical Approach 


A positive entropy change corresponds to a loss of information occurred 
during this transformation. The previous proposition States that contraction 
mappings always cause information loss (the smaller the A, the larger the 
loss). An example of this type of mapping is the max-pooling. The concept 
of entropy change will be further used when analyzing the entropy flow in 
neural networks. 


12.2 Entropy Flow 


We consider a feedforward neural network with L layers, whose activations 
are given by the random vector variables X^\ 0 < £ < L. The information 
field I® = 6(A^) can be assessed numerically using the entropy function 
of the £t\v layer activation, Assume the layers £ — 1 and £ have the 

same number of neurons, — d^\ and that — /(X^ -1 )), with / 

deterministic smooth function. If let = det Jj(X^ -1 )), then Proposition 
12.1.2 provides 


H{X W ) = H(X {e ~ 1) ) + E Px{i) [ln|A w 


1 < t < L. 


(12.2.4) 


Definitiori 12.2.1 The entropy flow associated with a feedforward neural 
network with layer activations X^ is the sequence {i7(X^)}o<r<L of 
entropies of the network layer activations. 


If the feedforward network performs a classification task, the entropy flow 
is expected to decrease to log 2 c bits, where c is the number of classes. This 
corresponds to the network’s role of information organization and reduction of 
uncertainty. For example, if the network has to classify pictures with animals 
into mammals and non-mammals, the entropy of the last layer is H ( X ^) = 1, 
even if the entropy of the input picture, H(X^), is a lot larger. 

The decrease in entropy is done gradually from layer to layer. For instance, 
in the case of a convolutional network applied on car images, the first layer 
determines small corners and edges; the next layer organizes the information 
obtained in the previous layer into some small parts, like wheel, window, 
bumper, etc. A later layer might be able to classify this information into 
higher level features, such as the car type, which has the smallest entropy. 

The following concept is useful in the study of the entropy flow behavior. 
We dehne the entropy leak between the layers £ — 1 and £ by the change in 
the entropies of the layer activations 


A(£ — M) = H(X {l ~ l) ) - H(X^). 


(12.2.5) 




Information Capacity Assessment 


357 


The next resuit States sufficient conditions for the information flow to be 
decreasing. This corresponds to a sequence of compressions of the information 
through the network. 


Theorem 12.2.2 Assume the following conditions are satisfied in a feedfor- 
ward neural network: 

(i) The activation function is increasing, with 0 < <f>'(x) < 1; 

(ii) There is the same number of incoming and outgoing variables for the £th 
layer, i.e., d^~ 1 ^ — d^; 

(iii) 0 < | det W^\ < 1. 

Then there is a positive entropy leak between the layers £ — 1 and £, i.e., 


A(£ — 1,£) > 0. 


Proof: Conditions (i) and (iii) substituted into (11.8.7) imply 0 < |A^| < 1. 
Using (12.2.4) and (12.2.5) we have 


A(£ -!,£) = -E * x(i) [ln | |] > 0. 


Remark 12.2.3 (i) If we assume |detVF^| < A < 1, then the following 
lower bound for the entropy leak holds: 


A(£ — 1, £) > — ln A > 0. 


(ii) In the case when the activation function is the logistic function, (j>(x) - 
cr(x), following Remark 11.8.1, we have the following explicit computationi 


ln | | = ln cr(s^) + ln(l — cr(s^)) + ln | det 


C) 


= — ln(l + e Si ) — ln(l + e Si ) + ln | det 


XO 


sp(-s^) — sp(s^) + ln | det 




2 sp(s^) + + ln | det 




where we used the softplus properties sp(x)= ln(l+e x ) and sp(x) — sp(—x) — x 
introduced in Chapter 2. 

Furthermore, if there is a matrix A such that — e" 4 , with positive 

determinant, then the last term becomes in the virtue of Liouville’s formula 
Indet = TraceA 














358 


Deep Learning Architectures, A Mathematical Approach 


(iii) If is an orthonormal matrix, then the £th layer corresponds to a 
rigid transformation (i.e., a rotation composed with a translation induced by 
the bias vector then ln | det W^\ — 0, and hence, there is no informa- 

tion loss in this case, A(£ — l,f?) = 0. 


12.3 The Input Layer Entropy 


Consider a feedforward neural network with the input X — X^\ having 
components (x[°\ ..., X^). The entropy of the input, H(X^), depends 
whether the input components are independent or not. In the case when 
the input variables X^ are independent, the entropy of the input random 
variable is the sum of entropies of individual components. 

Propositiori 12.3.1 Let X — (Xi,...,X n ) T be an n-dimensional random 
variable with independent components. Then 

n 

3 = 1 

Proof: Since the independence implies pw( x ) — Px i( x i) • • ’ Px n (x n ), using 
Fubini’s theorem, we have 


H(X) = - 


/ px(x)lnp x {x)dx = - p x (x) ln Px, (xj) dx 

JR n JR n ■ 

T / Px 1 (xi)---px n (x n )lnp Xj (xj)dx 
• JR n 

Px : {xj)lnp Xj {xj) dxj) n PXi(xi)dxi 

: J R j R 


^7 


= -TjijpXj (V) ln PXj ( Xj ) dxj 
j jR 

= 


j 


Remark 12.3.2 If the components Xj are not independent, the incoming 
informat ion is smaller, and hence we have the ineqnality H(X) < £-=i hocj). 

Assume the components of the input are not independent. Then the 
entropy of the input, H(X^), has an upper bound given by the determi¬ 
nant of its covariance matrix. This will be shown next. 



Information Capacity Assessment 


359 


Propositiori 12.3.3 Let X — (Xi,...,X n ) T be an n-dimensional random 
variable with covariance matrix A. Assume det A > 0. Then 

H(X) < - ln[(27re) n det A\. 

X 

Proof: Let p(x) be the density of the random variable X = (Xi ,..., X n ) T , 
with x = (xi,..., x n ) T E R n . The covariance matrix of X is the n x n 
matrix A = ( Aij)ij , with A{j — W[(Xi — Pi)(Xj — /ij)], where K[X} — /i — 

is the mean vector of X. By Proposition 3.5.1 we have the 

inequality 

H(p) < S(p,q), (12.3.6) 

for any density function q(x) on R n . In particular, the inequality holds for 
the multivariate normal distribution 

q{x) = 1 


y/{2n) n det A 


e 2 


^(x-jLl)A 1 (x —/i) 


It sufhces to compute the cross-entropy S(p, q). In the following we shall use 
upper indices to denote the inverse of the covariance matrix, A~ 1 = (A^)ij. 
Since the negative log-likelihood of q is given by 


— In q(x) = - ln[(27r) n det A\ H— (x — /i) A 1 (x — /i) 

2 2 


then 


S(p, i) — R p [— ln q\ — - ln[(27r) n det A] + -E p [(x — p)A 1 (x — /i)] 


= ^ ln[(27r) n det A} + ^E p ^ A lJ (xi - pi){xj - pj) 

fj =i 

i i n 

= -ln[(27r) n det A] + - ^ A lJ E p [(xi - pj)(xj - fij) 

fj =i 

1 1 n 

hj = 1 

Ti 

= ^ ln[(2vr) n det A| + 1 ^ = 1 ln[(27r)” det A] + 

i,j=l 


n 

2 


= - ln[(27re) n det A]. 

2 

Using now inequality (12.3.6) yields the desired relation 


H(X) = H(p) < - ln[(27re) n det A\. 

Xi 







360 


Deep Learning Architectures, A Mathematical Approach 


Note that the identity is achieved when X is multivariate normal random 
variable with covariance matrix A. This means that among all random vari- 
ables of given covariance matrix, the one with the maximal entropy is the 
multivariate normal. ■ 


Corollary 12.3.4 Let X = (Ab,..., X n ) T be an n-dimensional random vari¬ 
able with independent components. Then 


i 

H{X) < -ln[(27re)" JJ Var(Xi)}. 

i =1 


Proof: If X has independent components, then its covariance matrix has a 
diagonal form 



VarpQ), if i=j 
0, if j , 


and hence det A = nr=iVar(X i ). The resuit follows now from Proposition 
12.3.3. ■ 

As an exemplification, we shall apply next the concept of entropy flow to 
the linear neuron and to the autoencoder. 


12.4 The Linear Neuron 


Let X — ( X \,..., X n ) T and Y — (Yi,..., Y n ) T be the input and the output 
vectors for a linear neuron, respectively, see section 5.5. We have assumed that 
the nnmber of inputs is equal to the number of outputs. Since the activation 
function is linear, (j){x) — x, the output is related to the input by Y — f(X) — 
W T X - B, where W = is the weight matrix and B — (&i,..., b n ) is 

the bias vector. The Jacobian of the input-output function / is 



= W T . 

3,k 


By Proposition 12.1.2 the entropy of the output, Y, can be written in terms 
of the entropy of the input, A, as in the following: 


H(Y) = H(W t X - B) = H(X) - E Py [ln | det(VL r ) 
= H(X) + E Py [ln | det W\\ 

= H(X) + ln | det W 


1 


where we used that det W — det W T and the fact that W is a matrix with 
constant entries. 









Information Capacity Assessment 


361 


By Theorem 11.7.1 the linear neuron is uncompressed if det W 7 ^ 0. In 
the particular case when W is an orthogonal matrix, | det W\ = 1, then there 
is no entropy leak, i.e., H(Y) — H(X) — 0. There is a positive entropy leak 
in the case | det W | < 1. The condition | det W\ < 1 is implied by the use of 

small weights , see Exercise 12.13.19. Hence, the use of the regularization 
condition \\W\ | 2 < with a small enough, implies the previous inequality. 

If one considers a neural network made of linear neurons with layers of 
the same dimension, then 

H(XW) = H{X {i ~ l) ) + Indet \W% £ = 1,..., L. 

Using the properties of logarithms, we obtain the relation 


H(X^) = H(X^) + ln | det[W (1) W (2) ■ ■ ■ W {L) 


(12.4.7) 


Using (12.4.7) and Proposition 12.3.3 we obtain an upper bound for the 
entropy of the output layer of the neural network as 


H(X {L) ) < =- ln[(2vre) n det A] + ln | det[VF ( 1 ) W (2) • • • W {L) 


(12.4.8) 


12.5 The Autoencoder 


We consider a feedforward neural network with 3 inputs, 3 outputs and one 
hidden layer having a single neuron, see Fig. 12.1. The hrst part of the net¬ 
work, including the neuron in the middle, encodes the input signal into 

— &i), 

while the second part decodes the signal into the output vector 

^ <2) \ 


( 2 ) _ 




X 


1 


X 




X 


( 2 ) 

2 

(2) / 

3 / 


= ^>) = 


HWi2 X - 4 ^) 
V <i>(w^xP - 4 2) ) 


/ 


This very simple neural net is an example of an autoencoder. Since the input 
and output dimensions are equal, — d^ = 3, it makes sense to apply the 
information tools developed earlier. Denote the activation function in each 
neuron by (j). 

The sensitivity of the zth output with respect to the jth input is given by 

dx^ 


dxj 


(o) 


= 2 ) ) dsf ] 


dxj 


(0) 


dxj 


( 0 ) 




(i) 


(i) 




dx^ 

( 2 ), ( 2 )^( 1 ) 




















362 


Deep Learning Architectures, A Mathematical Approach 


x 







Figure 12.1: Autoencoder with one hidden neuron. 


The determinant of the previous Jacobian can be computed using the deter¬ 
minant multilinearity property on rows and columns as in the following: 

) = 0 / (4 1) ) 3< / ,, ( s l 2) )^ / (4 2) )^ ,, ( s 3 2) ) det ( w li w fl)- 

If any of the weights wff or are zero, then the determinant vanishes, 
det — 0. Assume now that none of the weights is vanishing. Then 



det (w$w^) 


( 2 ) ,( 1 ) ( 2 )( 1 ) ( 2 )( 1 ) 


w il w ll 


W 12 W 11 
( 2 ) ( 1 ) 
w 13 w ll 


w ll W 21 


W 12 W 21 
( 2 ) ( 1 ) 
W 13 W 21 


W 11 W 31 


( 2 ) ( 1 ) _,,( 2 ) ( 1 ) „,,( 2 ).,,(!) 


W 12 W 31 
( 2 ) ( 1 ) 
w 13 w 3l 


( 1 ) ( 1 ) ( 1 ) 
= ^ 11 ^ 21^31 


W 


W 


W 


( 2 ) „,,( 2 ) 


11 

( 2 ) 

12 

( 2 ) 

13 


W 


W 


W 


11 

( 2 ) 

12 

( 2 ) 

13 


W 


W 


W 


( 2 ) 

11 

( 2 ) 

12 

( 2 ) 

13 


= 0 . 


/ \ 

Therefore, det (— j-r) — 0, which means that the autoencoder is a 

K dxf )J 

A-contraction for any A > 0, see Proposition 12.1.5. The information is com- 
pressed from the input to the hidden layer, X W X X^\ The inclusion is 

striet since we cannot solve for all input components, x^\ in terms of the 
component x^\ 

















Information Capacity Assessment 


363 


(2) (2) 

If w\- 7 ^ 0, then rank — 1, and by Theorem 11.8.3 the output layer is 

uncompressed, X ^ = X^ 2 \ Hence, the autoencoder behaves as an information 
compressor. The information is compressed in the encoder layer and it is 
preserved in the decoding layer. 

12.6 Conditional Entropy 

Another measure of information between two random variables, X and Y, is 
the conditional entropy , H(Y\X). This evaluates the information contained 
in Y if the variable X is known. 

The formal definition is introdnced in the following. First, we shall assume 
that X and Y are discrete random variables with distributions 

p(xi) = P(X = Xi), p(yj) = P(Y = yj), i = 1,..., N, j = 1,..., M. 

The conditional entropy of Y given that X — Xi is dehned by 

M 

H(Y\X = = - ^2p(yj\xi) Inp(yj\xi), 

3 = 1 

where p{yj \xi) — P(Y = yj \X = xf). The conditional entropy of Y given X 
is the weighted average of H(Y\X — xi), namely, 

N 

H(Y\X) = 22p(xi)H(Y\X = Xi). 

i =1 

Using the joint density relation p(xi)p(yj\xi) — p(xi,yj), the aforementioned 
relation becomes 

N M 

H(Y\X) = - '22^p(xi,y j )\np(y j \x i ), (12.6.9) 

i=l j =1 

where p(xi, yj) = P(X — Xi,Y = yj) is the joint distribution of X and Y. 
The definition variant for the conditional entropy in case of continuous ran¬ 
dom variables is 

H(Y\X) — — / / p(x 1 y) Inp(y\x) dxdy, (12.6.10) 

J M J M 

where p(x,y) is the joint probability density of (X,Y), and p(y\x) is the 
conditional probability density of Y given X. The definition can be extended 
to the multivariate case. For instance, 

H(Z\X, ¥) = -[[[p(x, y , z) \np(z\ x, y) dxdydz , 


364 


Deep Learning Architectures, A Mathematical Approach 


H(X, Y\Z) = p(x, y , z) ln p(x, y\z) dxdydz. 

The first relation represents the conditional entropy of Z, given variables X 
and Y , while the latter is the conditional entropy of X and Y, given the 
variable Z. 

Propositiori 12.6.1 Let X and Y be two random variables, such that Y 
depends deterministically on X, i.e., there is a function f such that Y — 
f(X). Then H(Y\X ) = 0. 

Proof: First we note that the entropy of a constant is equal to zero, H(c ) = 0. 
This follows from the dehnition of the entropy as 

H(c) — ^^Pi lnjy = —1 ln 1 = 0. 

i 

The entropy of Y conditioned by the event {X — xi} is 

H(Y\X = Xi ) = H(f(X)\X = x,) = H(f(xi)) = 0, 
by the previous observation. Then 

H(Y\X) = J Zp{xi)H{Y\X = xi) = 0. 

i 

The proof was done for discrete random variables, but with small changes it 
can also accommodate continuous random variables. 


12.7 The Mutual Information 

Heuristically, if X and Y are independent random variables, then H(Y\X) — 
H(Y ), i.e., the knowledge of X does not affect the information contained in Y. 
However, if X and Y are not independent, it will be shown that H(Y\X) < 
H(Y ), i.e., conditioning a random variable decreases its information. The 
difference 

I(Y\X) = H(Y) - H(Y\X ) (12.7.11) 

represents the amount of information conveyed by X about Y. Examples 
include the information that face images provide about the names of the 
people portrayed, or the information that speech sounds provide about the 
words spoken. 

This concept will be used later in the study of compression of information 
through the layers of a neural network. A few basic properties are presented 
in the following: 



Information Capacity Assessment 


365 


Propositiori 12.7.1 For any two random variables X and Y defined on the 
same sample space we have: 

(а) Nonnegativity: I(X\Y) > 0; 

(б) Nondegeneracy: I(X\Y) — 0 X and Y are independent. 

(c) Symmetry: I(X\Y) = I(Y |X). 


Proof: (a) Using the definition relation (12.7.11), it suffices to show that 
H(Y) - H(Y\X) > 0. We have 


H(Y) — H(Y\X) — — j p(y) ln p(y) dy + JJ p(x,y) \np(y\x) dxdy 

— — J J p{pc , y) ln p(y) dxdy + J J p(x , y) \np{y\x) dxdy 


[[ p(x, y) ln dxdy = 

JJ p{y) 

DKL\p(x,y)\\p(x)p(y)\ > 0, 



p(x, y) ln t ’ y dxdy 
1K ,y) p{x)p{y) 


where we used the nonnegativity property of the Kullback-Leibler divergence 
Drl 5 see section 3.6. 

(5) Using the computation from part (a), we have 


I(X\Y) = 0 ^ H(Y) - H(Y\X) = 0 ^ D KL [p(x,y)\\p(x)p(y)} = 0. 


This occurs for the case ln 


p(x)p(y) 


0, namely, for p(x,y) = p(x)p(y), 


which means that X and Y are independent random variables. 

(c) The computation in part (a) shows that I(X\Y) — DKL[p{%,y)\\p(x)p(y)\. 
Since the expression on the right side is symmetric in x and y, it follows that 
I(X\Y) = I(Y\X). m 


The symmetry property enables us to write just /(X, Y) instead of I(X\Y) 
or I(Y |X); we shall call it the mutual Information of X and Y. This means 
that the amount of information contained in X about Y is the same as the 
amount of information carried in Y about X. 


Corollary 12.7.2 We have the following equivalent defmitions for the mutual 
information: 


I(X,Y ) = H(Y)-H(Y\X) 

= H(X)-H(X\Y) 

= D KL \p(x,y)\\p(x)p(y)]. 


The mutual information can be also seen as the information by which the sum 
of separate information of X and Y exceeds the joint information of (X, Y). 





366 


Deep Learning Architectures, A Mathematical Approach 


Proposition 12.7.3 The mutual information is given by 

I(X, Y ) = H(X) + H(Y ) - H(X, Y). 


( 12 . 7 . 12 ) 


Proof: Using that \np{y\x ) = li F = 1 np(cc,y) — lnp(x), we have 

/(X,y) = H(Y) - H{Y\X) = H(Y) + ff p(x,y) Inp(y\x) dxdy 

= H(Y) + ff p(x, y) lnp(x, y) dxdy — ffp(x,y)\np(x) dxdy 

= BV)-W'Y)-j «*)*«*)** 

= ff(y) -if(x,y) + if(y). 

Since the mutual information is nonnegative, relation (12.7.12) implies that 

#(x,y) < #(x) + i7(y). ■ 


Remark 12.7.4 There is an interesting similarity between properties of 
entropy and properties of area. The area of intersection of two domains A 
and £>, which given by 

area(A n B) — area(A) + area(B) — area(A U B) = area(A) — area(A\B ) 
is the analog of 

/(X, Y ) = H(X) + if(y) - i7(X, Y ) = tf(X) - ff(X|y). 

This forms the basis of reasoning with entropies as if they were areas, like in 
Fig. 12.2. Thus, random variables can be represented as domains whose area 
corresponds to the information contained in the random variable. The area 
contained in domain X, which is not contained in domain Y, represents the 
conditional entropy, H(X\Y), while the common area of domains X and Y 
is the mutual information /(X, Y). 

It is worth noting that the mutual information is not a distance on the 
space of random variables, since the triangle inequality does not hold. 

One of the most important properties of the mutual information, which 
is relevant for deep neural networks, is the invariance property to invertible 
transformations, which is contained in the next resuit. 

Proposition 12.7.5 (Invariance property) Let X andY be random vari¬ 
ables taking values in R n . Then for any invertible and differentiable trans¬ 
formations 0, if : R n —> R n ; we have 

I(X,Y) = l(ct>(X),iP(Y)). 


(12.7.13) 



Information Capacity Assessment 


367 



Y 



I(X,Y) 

Figure 12.2: Each random variable is represented by a donnain; the area of 
intersection of domains represents the mutual Information /(X, Y). 


Proof: We shall use the formula of entropy change under a coordinate trans- 
formation given by Proposition 12.1.2. Let X' = </>(X) and Y' — 
and denote F — : R™ x R™ —> Wf, x R™,, i.e., (x',y') = F(x,y) = 

(0(x), ^(y)). It follows that 


Jf(x,v) 


d(x,y) 


(J<p{x ) 0 \ 

V o XO)/ 


and hence det J F (x,y) — det J^(x) det Inverting, we obtain a similar 

relation 


det Jp-i (a/, ?/) = det J^-i (x) det J^-i ( y') 


Y 


Taking the log function yields 


ln | det j/)| = ln | det J^-i(V)| + ln | det J^-i(?/)|. (12.7.14) 


Y 


We then consider the expectation with respect to the joint distribution of 
(X',Y') and use formula (12.7.14) 


E Px ' Y ' [ln | det J F -i (X ', Y') |] = J jp(x', y') ln | det J F -i (x ', y') \ dx'dy 
JJ p(x\ y') ln | det J^-i (x) \ dx' dy' + J J p{pc', y') ln | det Jp-i ( y') \ dx 
J p(x) ln | det J^-i (x) \ dx' + J p(y') ln | det J^-i ( y') \ dy' 


= E Px ' [ln | det X-i (X') |] + E^' [ln | det J F -i {Y') |] 






Y 


(12.7.15) 























368 


Deep Learning Architectures, A Mathematical Approach 


Using Proposition 12.1.2 and formula (12.7.15), we obtain 


mx)^(Y)) 


I(X', Y') = H{X') + H(Y') - H(X', Y') 
H(X) - E Px ’ [ln | det J 0 -i (X') |] 

+H(Y) - E Px ' [ln | det J^-i (Y 1 ) |] 

—H(X, Y ) + E Px 'y' [ln | det J F -< (X Y ') 
H(X) + H(Y) — H(X, Y) 

I(X,Y), 


which is the desired resuit. 


The following resuit States that the conditional entropy decreases if extra 
conditions are added: 

Lemma 12.7.6 For any three random variables, X, Y, and Z we have 

H(X\Y,Z) < H(X\Z). 

Proof: Consider the difference AH = H(X\Z) — H(X\Y, Z). Using the deh- 
nition of entropy and properties of log functions and Kullback-Leibler diver- 
gence, we have 

AH — — Jp{pc, z) lnp(x\z) dxdz + Jp(x,y, z)\np(x\y, z) dxdydz 

— — Jp(x, y, z) \np(x\z) dxdydz + Jp(x,y, z) ln p(x\y, z) dxdydz 

p[x\y,z ) 

pix, y , z) ln — dxdydz 

p[x\z) 

pix, y , z) ln dxdydz 

V 7 p(x\z)p(y,z) 

= D KL \p(x,y, z)\\p(x\z)p(y, z)] > 0, 

with identity for p[x, y , z) — p(x\z)p(y, z). Since p{x, y , z) — p(x\y, z)p(y, z ), 
the previous identity is equivalent to p{x\z) — p(x\y, z ), which shows a mem- 
oryless property of X , given Y, with respect to Y. ■ 


We say that three random variables X , Y, and Z form the Markov 
chain X —> Y —±Z, if each variable depends only on the previous vari- 
able, see Fig. 12.3. In terms of conditional probability densities, we have 
p(Z\X, Y) — p(Z\Y ), which represents the memoryless property of the chain. 
The relevance for neural nets is given by the fact that the layer activations 
of a feedforward neural network form a Markov chain. 









Information Capacity Assessment 


369 



Figure 12.3: A Markov chain X Y —> Zsatisfies the memoryless property 
p(Z\X,Y)=p(Z\Y). 


Lemma 12.7.7 For any three random variables, X, Y, and Z that form a 
Markov chain, X —> Y —> Z, we have 

(a) H(Z\X, Y) = H(Z\Y ); 

(. b) H(X\Y,Z ) = H(X\Y). 

Proof: (a) Taking the log in the Markov property, p(z\x,y) — p(z\y), and 
then integrating, yields 


Jp{x,y,z ) \np(z\x, y) dxdydz — Jp(x,y,z ) \np(z\y) dxdydz. 


The left side is equal to —H(Z\X,Y) and the right side to H(Z\Y), which 
proves the identity in part (a). 

( b ) From the Markov property, p(z\x,y) — p(z\y). Then, using the law of 
conditional probabilities, we obtain 


p{x\y,z) 


pXPhz) 
p{y,z ) 


p(x, y)p(z\x, y) 
p(y)p(z\y) 


pixy) 

p(y) 


p(x\y). 


Taking the expectation in the previous formula and using the definition, we 
have 


Jp(x,y,z) Inp(x\y, z) dxdydz — Jp(x,y,z) \np(x\y) dxdydz , 


which is equivalent to the desired identity H(X\Y, Z ) = H(X\Y). 


Remark 12.7.8 Identity (a) shows that the uncertainty of a variable in a 
Markov chain conditioned by the past is essentially conditioned only by the 
previous variable (or the nearest past valne). Identity ( b ) States that the 
uncertainty of a variable in a Markov chain, conditioned by the future, is 
essentially conditioned by the next variable (or the nearest future value). 



370 


Deep Learning Architectures, A Mathematical Approach 


Another property of mutual information useful in the context of deep 
learning is given by the following resuit, which States that information can- 
not be increased by data processing. This will be used later in the study of 
neural nets, where we use the fact that lossy compression cannot convey more 
information than the original data. 

Propositiori 12.7.9 (Data processing inequalities) For any three random 
variables X, Y, and Z, which form a Markov chain, X —> Y —> Z, we have: 

(a) I(X,Y)>I(X,Z); 

(b) I(Y,Z)>I(X,Z). 

Proof: (a) Subtracting the identities 

I(X,Y) = H(X)-H(X\Y) 

I(X,Z) = H(X) - H(X\Z), 

and using Lemma 12.7.7, part (6), and then Lemma 12.7.6, yields 

/(A, Y) - /(A, Z) = H(X\Z) - H(X\Y) = H[X\Z) - H(X\Y ., Z) > 0. 

(■ b ) We have 

I{Y,Z) = H{Z)-H(Z\Y) 

/(A, Z) = H(Z) - H(Z |A). 

Subtracting, using Lemma 12.7.7, part (a), and then Lemma 12.7.6, yields 

/(y, Z) - /(A, Z) = H{Z I A) - H(Z\Y) = H(Z |A) - H(Z |A, y) > 0. 


The previous resuit can be restated by saying that in a Markov sequence 
near variables convey more information than distant variables. 

Example 12.7.1 A group of children play the wireless phone game. One 
child whispers in the ear of a second child a word, the second one whispers 
the word to the third one, and so on. The last child says loudly to the group 
what he or she understood. The fun of the game is that the output is quite 
different than the input word. 

Since each member whispers the message fast, this adds uncertainty to 
the process and fun to the game. This can be modeled as a Markov sequence, 
since the understanding of each child is conditioned by the previous child 
message. The uncertainty of the message is described by the entropy. Lemma 
12.7.7 (a) States that the uncertainty of a given message is conditioned only 
by the previous message. The mutual information, /(A, y), represents the 


Information Capacity Assessment 


371 


amount of common information between two whispers X and Y. Data Pro¬ 
cessing inequalities (Proposition 12.7.9) state that the amount of common 
information of two consecutive whispers is larger than the common informa¬ 
tion of two distant ones. 


Example 12.7.2 Consider the Markov chain X^ 0 ) —> > X^ 2 \ where 

x(°) is the identity of a randomly picked card from a usual 52-card pack, 
X( 1 ) represents the suit of the card (Spades, Hearts, Clubs, or Diamonds), 
and X ® is the color of the card (Black or Red), see Fig. 12.4. We may 
consider X^ 0 ) as the input random variable to a one-hidden layer neural 
network. The hidden layer X^ 1 ) is pooling the suit of the card, while the 
output Xis a pooling layer that collects the color of the suit; X ^ can be 
also considered as the color classifier of a randomly chosen card. 

We shall compute the entropies of each layer using their uniform distributions 
P(X^ = Xi) = 1/52, as well as 


Using the dehnition of the entropy, we have 

52 

H(X^) = -E p(xi) ln p(xi) — — ln(l/52) = ln 52 

i=i 

4 

H(X^) = — £p( Si ) lnp(si) — — ln(l/4) = ln4 

i=\ 

2 

H(X^) = — lnp(cj) = — ln(l/2) = ln 2. 

1=1 

We notice the strictly decreasing flow of entropy 

i7(X (0) ) > i7(X (1) ) > i7(X (2) ). 


XW 

Red 

Black 

P(ci ) 

1/2 

1/2 



Hearts 

Clubs 

Spades 

Diamonds 

P(Si) 

1/4 

1/4 

1/4 

1/4 


Dividing by ln2, we obtain the entropy measured in bits. This way, H(X^) 
is 5.726 bits, H(X^) is 2 bits and H(X^) is just 1 bit. This means that in 
order to determine a card randomly picked from a deck we need to ask on 
average 5.726 questions (with Yes/No answers), while to determine its color 
it’s suffices to ask only 1 question (for instance, we may ask Is the color Red?). 

If the card identity, X^ 0 ), is revealed, then its suit is determined, and 
hence, X^ 1 ) has no uncertainty, H(X^\X^) = 0. This can be formally 
shown by noting that 



xf ] ) = P(X (1) 




1, if card Xj has suit 
0, otherwise, 














372 


Deep Learning Architectures, A Mathematical Approach 


so that the conditional entropy becomes 




52 4 

3 =1 ?‘ =1 
52 4 

— = 0 . 

3 =1 i=l 


Similarly, we have H(X^ |X(°)) = 0, since the color of a card is known 
once the card identity is revealed. We compute next the mutual information 
between the input layer and the other two layers: 


I(X (0 \X W ) = H(X^) - H(X W |X (0) ) = H(X W ) = In4 




I(V 0 ) ,X (2) ) = #(X (2) ) -H(X^\X^) = H(X { - 2) ) = ln2 




It follows that the following data processing inequality 

/(V 0 ) ,X (1) ) > I(X (0) , V 2) ) 


is verihed strictly. 


12.8 Applications to Deep Neural Networks 

Before presenting the applications to deep neural networks, we recall some 
notations. Let and be the activations of the layers £ — 1 and £ 

of a feedforward neural network. can be also seen as the input to the 

£th layer. 

12.8.1 Entropy Flow 

We have studied the entropy flow in the case when layer activations are 
continuous random variables. In this section we shall assume they are discrete. 

Proposition 12.8.1 // the layer activations of a feedforward neural network 
are discrete random variables, then the entropy flow is decreasing 

H(X^) > tf(X (1) ) > • • • > H(X {l) ) > 0. 



Information Capacity Assessment 


373 






Red 





Black 


Figure 12.4: The Markov chain X® —> > X^ 2 \ where is the 

identity of a card, X ^ is the suit of a card, and X^ is the color of a card. 


Proof: Since the activation of the Ith. layer depends deterministically on the 
activation of the previous layer, by Proposition 12.6.1 

we have H(X^\X^~^) — 0. We shall compute the mutual information 



374 


Deep Learning Architectures, A Mathematical Approach 


I(X^\X^ in two different ways: 


= H(XW) - H(X® 

= F(X (£_1) ) - H(X^~ 1 \X 


= H(XW) 
< H(X (e ' 



I(X ie) ,X {e - 1 ^) 

I{X^\x^-V) 

since H(X^~^\X^) > 0, as the variables are discrete. From the last two 
formulas we infer H (X (11 ) < H(X {l ~ l> ). The equality holds for the case 
when F is invertible, since 


H 



-i) 



= H(F 


-i 




= 0 . 


by Proposition 12.6.1. 


Corollary 12.8.2 Assume the layer activations of a feedforward neural net- 
work are discrete random variables and the following conditions are satisfied: 

(i) The activation function is striet increasing; 

(ii) The network has constant width: d^ — d ^ = • • • = d^ ; 

(iii) The weight matrices are nondegenerate, det ^ 0, for ali 1 < £ < L. 
Then the entropy flow is constant 

H(X {0) ) = H(X W ) = ••• = ••• H(X (L) ). 


12.8.2 Noisy layers 

In all feedforward neural networks used so far the activation of the £th layer 
depends deterministically on the activation of the previous layer 

X {e) = F(X {e - l) ) = (f)(W {e)T X {e -V - B&). 

In the rest of this chapter we shall assume that the layer activations are 
perturbed by some noise, which is a random variable independent of the 
layer X^~ l \ as 


One easy way to accomplish this from the computational perspective is to 
consider an additive noise 


X® = F{X { - l ~ l) ) + e W = 4>(W {t)T X^~ l) - B w ) + e 


W 


with E[e^] = 0 and some known positive entropy, In the case of an 

additive noise, a computation provides 

H(X® \X^~V) = H(F(xV-V) + eW\X { - e -~ 1 ">) = H(e 










Information Capacity Assessment 


375 



Figure 12.5: If ^ and are independent, then is independent of 
the previous layers of and X^~ 1 ^ is independent of the next layers of 



since the noise is independent of the (£ — l)th layer. 

We shall present next some applications of mutual information to inde- 
pendence and information compression in a feedforward neural network with 
noisy layers. 

12.8.3 Independent layers 

The layers r and i are called independent if I(X^ r \ X^) — 0, i.e., if the layer 
activation does not contain any information from the layer activation 
X^ r \ Equivalently, the random variables and X^ are independent by 
Proposition 12.7.1, part (5). 

Proposition 12.8.3 (Separability by independence) Assume the layers 
£ — 1 and £ are independent. Then 

(а) The layer £ is independent of any layer r, for any r < £ — 1. 

( б ) The layer £ — 1 is independent of any layer k, for any k > £. 

Proof: The proof is a consequence of the data processing inequalities, Propo¬ 
sition 12.7.9, and the nonnegativity of the mutual information, see Fig. 12.5. 
(a) For any r < £ — 1, we have 

o < /(i (r U w ) < = o, 

from where I(X^ r \X^) = 0 . 

(5) For any k > £, we have 

o = vy > /(x^-yvy > o, 

which implies I(X^~ 1 \ X^) = 0. ■ 

Roughly speaking, if two consecutive layers of a feedforward neural net¬ 
work are independent, then any layer before them is independent from any 
layer after them. 








376 


Deep Learning Architectures, A Mathematical Approach 


12.8.4 Compressionless layers 

We consider a feedforward neural network with noisy layers. The £th layer is 
called compressionless if 


i.e., the input X t(r> conveys the same information about both X ( ' - 1 ) and X (r> . 


Remark 12.8.4 In the absence of noise, both A ' ( ' b and A' ( ' 1 depend deter- 
ministically on X ((>) and in this case we have 


I(X^°\X 



= H(X ie ~ 1) ) - H(X {e ~ 1) |X (0) ) = H(X^~ l) ) 
= H(X W ) - H(X^ |X (0) ) = H(X^), 


where we used Proposition 12.6.1. Therefore, the t?th layer is compressionless 
in the sense of the previous definition if namely, no 

entropy leak between these layers. This relation is implied by the following 
three conditions: 

(i) the activation function satisfies <f>' > 0; 

(ii) 

(iii) the weight matrix is nonsingular. 

We note that conditions (i)-(iii) are also necessary conditions for the layer 
X^ to be uncompressed, see Proposition 11.6.3 and Theorem 11.7.1. These 
two are distinet concepts, as the former relates to the measure of informa¬ 
tion and the latter to the associated sigma-algebra. However, both concepts 
describe the fact that some sort of information invariance occurs between 
layers t — 1 and L 

The next resuit deals with a few equivalent ways to show that a layer is 
compressionless. It is included here for completeness reasons. 


Proposition 12.8.5 Consider a feedforward neural network with noisy lay¬ 
ers. The following are equivalent: 

(i) The ith layer is compressionless; 

(ii) /(X(°U W ) = I(X^°\X^ +1 ^); 

(iii) H(XW |X^ +1 )) = H(XW X (m) ); 

(iv) p(x^\x^ +1 ' ) ) = p(x^\x^\x^ +1 ' > ); 

(v) p(x^\ |cc(°)) = p(x^ + ^\x^)p(x^' l \x^ + ^). 


Proof: (i) (ii) It comes from the definition. 









Information Capacity Assessment 


377 


(ii) (iii) Writing the mutual information in terms of entropy, we have 

= I(X^,X^ +1S >) & 

H(X^) - H(X W \X W ) = H(X^) - H(X W |X (m) ) 

H(X (0) \X {e) ) = H(X W |X (£+1) ). 

(iii) (iv) Since X ((>> —? X^ — > X^ +l ' t is a Markov chain, Lemma 12.7.7, 
part (b) implies H(X^\X^\X^ +l ^) — H(X^\X^). On the other side, 
Lemma 12.7.6 yields H(X^\X^) > H(X^\X^\X^). The last two 
expressions lead to the inequality 

H(X^\X^ +l ^) > H(X^\X^). 

According to the proof of Lemma 12.7.7, this inequality becomes identity 
(as in (iii)) if and only if 


p(x^ — p(x^\x^\x 



which is (iv). 

(iv) (v) It follows from a computation nsing conditional probability den- 
sities. We have 


p(x^ |x^ +1 )) = 

p(x^\x^\ = 


p(x^ , ) p(x^ +1 ) )p(x^) 

p(xh y + 1 )) p(xh y + 1 )) 

p(x^\ x^\ p(x^\ 


p(x^\ £^ +1 )) 


p(x^\ x h ;+1 )) 


Equating p(x^ |x^ +1 ^) = p(x^\x^\ yields 


v(x w X v+D x (oh = v(x v+i) (o 


Using ~ =p(x^|x^ +1 ^) yields eqnation (v). 


12.8.5 The number of features 

In this section we discuss a useful interpretation of entropy and mutual infor¬ 
mation for estimating the size of the input and output layers of a feedforward 
neural network, as well as the number of described features. 

Assume we have a discrete random variable, A, that takes n values, aq, 
aq,..., x n with probabilities pi, p 2 , • • • ,Pm respectively. Assume that the 
value of X is revealed to us by a person who can communicate only by 


























378 


Deep Learning Architectures, A Mathematical Approach 


means of the words “yes” and “no”. Then we can always be able to arrive 
to the correct value of X by a finite sequence of “yes” and “no” answers. 
Then the entropy H(X) is the average minimum number of “yes” and “no” 
answers needed to find the correct value of X. The content of this resuit is 
known in coding theory under the name of the noiseless coding theorem , see, 
for instance, the reference [11]. 

The number of values taken by the input data X can be estimated from 
its entropy as n 2 h ( x \ where H(X) is measured in bits. Now, consider a 
neural network with input X and output Y. Each value, of Y is considered 
to be a certain feature of the input X. The conditional entropy of X, given 
the feature Y = yp is dehned by 

n 

H(X\Y — yi) — — ^ ~2p( x k\Vi) lnp(x k \yi), 

k=l 

and represents the uncertainty of X given that X contains the feature yi. 

The size of the input data with feature yi is approximated by =Vi \ 

This size depends on the feature, but we can consider their mean. The average 
size of the input data with a given feature is estimated by 2 H ^ X \ Y \ where 

m 

H(X\Y) = J2p(yj)H(X\Y = yi ) 

3 = 1 

is a weighted sum of the conditional expectations. The estimated number of 
subsets of the input data is obtained dividing the size of X to the average 
size of a subset of the same feature, i.e., 

‘YA = oH(X)-H(X\Y) 9 I(X,Y) 

2 H(X\Y) 

Since each subset of X is mapped one-to-one to a feature, then esti- 

mates the number of features of the output layer. The information /(X, X) 
is measured in bits. 

For instance, in the case of the MNIST data classification, there are 10 
classes of digits, or 10 features. In this case the network should be constructed 
such that the mutual information between the output and the input satisfies 
/(X, Y) — log 2 10. However, if /(X, Y) < log 2 10 the network will lead to an 
underht, since the output information conveyed by the input is too small to 
classify 10 digits. 

We shall verify these formulas on the following concrete example. 

Example 12.8.6 We shall consider the example of selecting a random card 
from an ordinary 52-card deck. Then each card has the selection probability 



Information Capacity Assessment 


379 


equal to pi — 1/52. A first question, Ql, can be: Is the suit color Red? 
Assume the answer is “yes”. Then the next question, Q2, can be: Is the 
suit Diamond? Assume the answer is “no”. This means the correct suit is 
“Hearts”. Since there are 13 cards, the next question, Q3, can be: Is the card 
number larger than 6? Let the answer be “no”. This leaves us with only six 
choices: 1,2, 3, 4, 5, and 6. We divide them into two sets and decide again 
the set the card belongs to by asking Q4: Is the card number less than 4? 
Assume the answer is “yes”. This implies the card is either 1, 2, or 3. The next 
question, Q5, can be Is the card number less than 2? If the answer is “yes”, 
it means that the card is the Aee of Hearts , and in this case 5 question would 
suffice to determine the card identity. However, if the last answer is “no”, 
then the card is either a 2 or a 3. We determine it by asking Q6: Is the card 
number 2? An “yes” answer implies that the card is a 2 of Hearts, while a “no” 
answer implies the card is a 3 of Hearts. In conclusion, we need either 5 or 6 
questions to determine exactly the card identity. If the experiment of hnding 
the card is performed many times and then we consider the average of the 
number of questions needed, we obtain 5.7. This means that the knowledge 
of the identity of a randomly selected card from a 52-card deck contains an 
information of 5.7 bits. 

On the other side, the entropy of the random variable of extracting ran¬ 
domly a card is 


52 

H(X) — — pi ln (p^ — — ln(l/52) — 3.95. 

i=1 

Dividing by ln2 we obtain the information in bits as 3.95/ln 2 = 5.7, which 
agrees with the aforementioned value (If we do not divide by ln2 the infor¬ 
mation is measured in nats rather than bits). 

Consider now the random variable Y, which represents the suit of a card 
randomly selected from the same deck. The variable Y takes only four values: 
Spades, Hearts, Diamonds , and Clubs. The conditional entropy given one of 
the features is 


11 

— ln — = ln 13 ^ 2.56. 

13 13 

Considering the weighted sum with p(xf) — 1/4, we obtain the conditional 
entropy of X , given features variable Y, as 

4 1 

H(X\Y) = y jH(X\Y = yi) = ln 13 « 2.56. 


H(X\Y = Hearts) = - 



380 


Deep Learning Architectures, A Mathematical Approach 


Then the mutual information of X and Y can be estimated as 

/(X, Y ) = H(X) - H(X\Y) « 3.95 - 2.56 = 1.39, 

which, after dividing by ln2 we obtain 2 bits. Then 2 I ( X,Y " > — 4, which cor- 
responds to 4 classes, one for each suit. 


12.8.6 Total Compressiori 

The compressiori factor of the £th layer of a feedforward neural network with 
noisy layers is dehned by the ratio of the following mutual information: 

pe - 


From the data processing inequality, Proposition 12.7.9, part (a), we have 
I{X^\X^~ 1 ^) > I{X^\ X^). This fact implies 0 < pr < 1. The compres- 
sion factor is 1 if the layer is compressionless and is equal to 0 if X and 
iW are independent layers. 

The product of all compression factors is independent of the hidden layers 
of the network as the following computation shows: 

I(X(°),X(V) I(x(°\x( 2 )) I(X(°\x( L )) 

plp2 '" pL - pxWpm) /(I(0),J(1)) ''' X( L ~V) 

I(X ( 0 \X^) _ I{X ( 0 \X(O) 

I(X(°),X(°)) - H(X(°)) ’ 


where we used 
suggests that the quotient 


H(X^) 


H(XW |X(°)) 


H(X^). This 


/(x(°),x (L) ) 

H(XW) 


( 12 . 8 . 11 ) 


describes the total compression in a feedforward neural network. It represents 
the amount of information shared by the input and the output of the network, 
scaled by the amount of input information. 


Remark 12.8.7 We make two remarks, which follow easily. 

(i) The total compression p — 1 (no compression) if and only if all layers are 
compressionless. 

{ii) The total compression p — 0 if and only if there is a layer independent 
from the input X°. 










Information Capacity Assessment 


381 


We note that if the feedforward neural network has to classify the input 
data into c classes, then the last layer has to have the size d^ — c. If the 
input layer has the size n — , then we shall estimate the total compression 

in terms of these two parameters. 

Propositiori 12.8.8 The total compression of a feedforward neural network 
with n inputs that classifies data into c classes is given by p — log n c. 

Proof: Since the number of classes is c — 2 /( A (0) A (L) ) 5 and the input size is 
n — 2 h ( x ^\ using relation ( 12 . 8 . 11 ) and the change of base formula for the 
logarithm, we obtain 

/pcyxW) io g2 c 

P H{} f(°)) log 2 n ° gnC ' 


For instance, in the case of the MNIST database classification of digits, 
the number of classes is c = 10 and the input size is n — 28 x 28 (each 
picture is a matrix of 28 x 28 pixels), then the total compression needed for 
classification is p — log 784 10 = 0.345. 

Remark 12.8.9 We note that the total compression in the case of the absence 
of the noise from the layers is given by the quotient of the output and the 
input entropies 

_ H(X^) 

P ~ HpxW) ‘ 

12.9 Network Capacity 

We have seen that a feedforward neural network can be interpreted as an 
information compressor. Now, this section addresses the following question: 

How large is the information conveyed by the input pattern given you 
observe the network response? 

In order to answer this question we need to define the concept of network 
capacity. This can be informally defined by stating that capacity is the ability 
of a neural network to fit a large variety of target data. A low capacity network 
struggles to fit the training data and may lead to an underht, while a high 
capacity network memorizes the training set, leading to an overht. The capac¬ 
ity depends on the network architecture, more specifically on the number of 
neurons, number of weights and biasses, learning algorithm adopted, etc. 

In the following we shall further formalize this concept and also hnd an 
exact formula for the capacity in the case of feedforward neural networks 
with noisy layers given by discrete random variables. 





382 


Deep Learning Architectures, A Mathematical Approach 


12.9.1 Types of capacity 

Consider a feedforward neural network, having L — 1 hidden layers and 
neurons in the f?th layer, 0 < £ < L. Denote the input variable by X — X^ 0 ) 
and the output by X = X^ L \ and the target variable by Z. If X is a random 
variable, then the output, Y, is also a random variable. 

There are three mutual information of interest, /(X, X), I(Y,Z ), and 
/(X, Z ), which we shall address shortly. 

1. The first one, /(X, X), represents the amount of information contained in 
X about Y , or, equivalently, the amount of information processed by the net¬ 
work. This depends on the input distributionp(x), as well as on the System of 
weights, W^\ and biasses, B^\ Then, varying the distribution p(x), keeping 
matrices W^\ hxed until the information processed reaches a maximum, 
we obtain the network capacity corresponding to the weight System (W, B) as 

C(W,B) = max/(X, Y). (12.9.12) 

p(x) 

This represents the maximum amount of information a network can process 
if the weights and biasses are kept hxed. It is worth noting the similarity 
to the dehnition of channel capacity , which comes into the famous Channel 
Coding Theorem, see Chapter 8 of [131]. 

Varying the System of weights and biasses we obtain the maximum pos- 
sible information processed by the network, called the total network capacity 



max{C(W,B)-\\W\\ 2 + B 
WB 



(12.9.13) 


The regularization constraint ||1X|| 2 + \\B\\ 2 < 1 has been added in order to 
keep the variables (IX, B) into a compact set and to assure for the existence 
of the maximum. 

The maximum information processed by a network with a given bounded 
input entropy is measured by the essential capacity 


C(M)= max /(X,X). (12.9.14) 

H(X)<M 


2. The mutual information /(X, Z) represents the amount of information 
conveyed by the network output, X, about the target variable Z. During the 
learning process we expect that /(X, Z) will tend to the target entropy, H(Z). 
Assuming the random variables have discrete valnes, we have H(Z\Y) > 0. 
Then the inequality 


/(X, Z) = H(Z) - H(Z\Y) < H(Z) 






Information Capacity Assessment 


383 


shows that the mutual information I(Y, Z ) is bounded from above by H(Z ), 
and hence for a better learning of Z the mutual information I (Y, Z) has to 
be as large as possible. 

3. The third amount of interest, /(X, Y), represents the mutual information 
contained in the training pair (X, Z). This can be computed explicitly since 
the training distribution p xz is given. 

Before proving the existence of capacity, we shall introduce first some termi- 
nology. 

12.9.2 The input distribution 

The input layer has — n neurons and the input random variable is given 
by X = (Xi,..., X n ). Each component X& is a real-valned random variable, 
which takes values in the finite set { X k , X X r k k }. The input probability is 

given by 

P(X = x) = P(X i = x[\ ... ,X n = 4 n ) = • • ,44 

for x = (4,..., 4). 

In the particular case when there is only one neuron in the input layer, 
namely, — 1 and X = Xi, we assume that X takes N values, {xi,..., x N }, 
and each valne is taken with a given probability 

p(xi) = P(X = Xi ), i = 1,..., X, 

which forms the input probability distribution, see Fig. 12.6. 

12.9.3 The output distribution 

The output layer has d^ — m neurons and the output random variable is 
given by Y — (Yi,..., Y m ), with each component real-valued random vari¬ 
able, taking values in the finite set {y£, • • • ? Vk}- The output probability 

is given by 

P(Y = y) = P(Y 1 = y{\...,Y m = yt) - p(y{\. ■ ■, yt), 
for y = (yi 1 ,.. .,!#?)■ 

In the particular case when the output layer has only one neuron, then 
Y — Y\ takes values in {yi,..., y M } with probabilities 

p(yj) = P(Y = Vj), j = 1 , • • •, M, 

which forms the output distribution, see Fig. 12.6. 


384 


Deep Learning Architectures, A Mathematical Approach 



X 

© 

© 

O 

Y 


O 

o 

© 

p(yi) 

O 

o 

o 


©_ 




O 

o 

© 

_ p(y f ) 

o 

o 

o 


o 

o 

© 

p(y M ) 

o 

(*n) 

o 




Figure 12.6: The output density p(yj) in terms of the input density p{xi). 


12.9.4 The input-output tensor 

We consider the multi-indices T — (zi,..., i n ) and J — {j i,... ,j m ) and let 

■yA' — (A 1 An\ -yrJ — (q J 1 q ,jm A 

A v"°l 5 • • • 5 ^n /5 J \Ul Um )’ 


The input-output tensor q x j — Qi 1 ,...,i n - 1 j 1 ,...,j rn is dehned by the following 
conditional probability: 



P(Y = y J \X = x x ) = p(y{ 


x- 


,n 


y 


J 


m 


5 tfm 



The tensor q x j depends on the architectural structure of the network (with 
hxed weights and biasses) and is independent of the input distribution p(x). 
Since 


P(Y = y J ) = J2 P(X 

X 



X = x I )P(X 



1 


the tensor q x a transforms the input distribution into the output distribution 

by the formula p(y X ) — q T J p(x x ), or in the equivalent detailed form 

x 



■■>yfc)= T 



■Un’,31 




12.9.5 The input-output matrix 

If d = d lL} = 1, then the neural network is characterized by the following 
input-output matrix qij — p(yj\xi), where 

p{yj\xi) = P(Y = yj\X = Xi), 1 < i < N, 1 < j < M. 



Information Capacity Assessment 


385 


The role of the matrix Q — qij is to transform the input distribution into 

the output distribution by the formula p(yj) — ^2i=iP( x i)p(yj\ x i), or m an 
equivalent matrix form, p(y) — Q T p{x). If N = M and the matrix Q is 
nonsingular, given the output density p( y), there is a unique input density 
p(x) that is transformed into p(y). 

However, if M -/V, then Q is not a square matrix, and then it does not 
make sense to consider its determinant. In this case it is useful to assume 
the maximal rank condition, rank(Q) = rank(Q T ) = M. Under this condi- 
tion, there is at most one solution p(x) for the aforementioned equation, see 
Exercise 12.13.4. The solution existence will be treated in the next section. 

Since for a hxed value X = Xi, the random variable Y takes some value 
2 /j, we have Ylj=iP(Vj\ x i) ~ 1- This also writes as Ylj=i Qij ~ h be., the 
sum of the entries on each row of the input-output matrix Q is equal to 1. 
Since the entries of Q are nonnegative, qij > 0, it follows that Q is a Toeplitz 
matrix. 1 This property of the matrix qij will be used several times in the next 
section. 

Remark 12.9.1 Consider a feedforward neural network with noiseless layers, 
such that the input-output mapping, /, is bijective, with f(xi ) = yi and 
M — N. The input-output matrix in this case is given by 

Qij = P(Y = y j \X = x i ) = P(f(X) = y j \X = x i ) 

= P(f(xi ) = yj) = Sij, 

since the event {f( x i) — Vj} is either sure or impossible. Hence, the input- 
output matrix is the identity, Q — I/v- 

If the injectivity of / is dropped, it is not hard to show that the entries 
of the matrix Q consist only in Os and ls, with only one 1 on each row (the 
rows are “hot-vectors”). This fact agrees with the Toeplitz property of Q. 

12.9.6 The existence of network capacity 

The fact that the defmition of the network capacity makes sense reduces 
to the existence of the maximum of the mutual information I(X,Y) under 
variations of the input distribution. For the sake of simplicity we shall treat 
the problem in the particular case d = d^ = 1, i.e., when there is only 
one input and one output neurons. We switch the indices into n — N and 
m — M, and determine a formula for I(X,Y) in terms of the input distri- 


1 Sometimes it is also called a stochastic matrix. 



386 


Deep Learning Architectures, A Mathematical Approach 


bution p(xi), using the definitions on the mutual information, entropy, and 
conditional entropy: 


I(X, Y) 


H(Y) - H(Y\X ) 


m 


n m 



J 2 p(yj) 1 np(yj) + Vi) in P(yj\xi) 

.7 = 1 i=l .7=1 

m n 

% 

P(xi)p(yj\xi)lnp(yj) 

.7 = 1 i=l 
n m 

+ 'D'Dp(xi)p(yj\x i ) In p(yj\xi) 

7=1 j=l 
m n 

P(xi)p{yj\xi ) ( lnp(yj|xj) - lnp(y_j) 



.7=1 i=l 
m n 


n 



P{xi)p(yj\xi)(lnp(y j \x i ) - ln^2p(xi)p(yj\xi) 


.7 = 1 i=l 
m n 


7=1 


n 



p(xi)qij(\nqij - ln ^p(xi)qij ). 


.7=1 i=l 


7=1 


It follows that for a given input-output matrix, (pj, the mutual information 
/(X, Y) is a continuous function of n real numbers, p{x i),... ,p(x n ), which 
belong to the domain 


n 

K = {(pi, . . . ,Pn);Pi > 0, = 1}. 

7=1 


Since 0 < pi < 1, then X C X^n(0,1), i.e., X is bounded. The set X is also 
closed, as an intersection of two closed sets, X — S n n X n , the hyperplane 

n 

Kn = (CPl, • • • ,Pn)\YlP i - 1 J'’ 

7=1 


and the first sector 


S n = {(Pl, ■ ■ ■ > Pn)i Pi > 0}. 

Being bounded and closed, it follows that X is a compact set in R n . Since any 
continuous function on a compact set reaches its maxima on that set, using 
that /(X, Y) is a continuous function of pi on X, it follows that there is an 
input distribution p* = p(xi) for which /(X, Y ) is maximum. The maximum 
distribution p* can belong either to the boundary of X, case in which at 
least one of the components is zero, or to the interior of X. In the latter case, 
calculus methods will be employed to characterize the maximum capacity. 



Information Capacity Assessment 


387 


12.9.7 The Lagrange multiplier method 

In this section we shall find the network capacity using a variational problem 
involving a Lagrange multiplier, A, which is used to introduce the linear 
constraint Vi — 1- The function to be maximized, F : K —> R, is given 

by 


n 


F(pi, ■ . . ,Pn) = I(X,Y) + a(J> - l). 

1=1 


The function F(pi,... ,p n ) is concave, as a sum between a concave function, 

/(X, T), and the linear constraint function in p. Therefore, any critical point 

p* of F 1 , which belongs to the interior of set X, is a point where F reaches 

a relative maximum, i.e., where \7 p F(p*) — 0. Furthermore, if this is unique, 

then the point corresponds to a global maximum. We shall compute next the 

OF 


variational equations 
Using that 


dpk 


— 0, 1 < k < n. 


dp(yj) dp(yj ) 


dpk 

dH(Y) 

dp(Vj ) 


dp(x k ) 

d 


= p{Vj\xk ) = Qkj 


rn 


dp(Vj) 


T FI P(y r ) 1 np(y r ) = -(1 + In p(yj)) 


r= 1 


then chain rule implies 


dH(Y) 

dpk 


m 


= E 


dH dp{yj) 


m 


9p(yj) 0p k 


Ea + ln p(Vj))qk 

3 = 1 


3 


m 


-1 - y In p(yj)qk 

3 = 1 




where we used the Toeplitz property of q^j. We also have 


0H(Y\X) 

dpk 


_d_ 
dpk z 


n rn 



p{x i )p{yj\x i )\n.p(y j \x i ) 


d 


i =1 3 = 1 
n rn 


rn 


dpk 



PiQij In Qij = ~Fj qk i ln qk i 


i =1 .7=1 


.7 = 1 



388 


Deep Learning Architectures, A Mathematical Approach 


Assembling the parts, we have 


dF 

dpk 


d 


n 


I(X,Y) + A(2>-1 


d Pk L i=1 

dH(Y ) dH(Y\X) 


dpk 


dpk 


+ A 


m 


m 


= -! - ^2 ln p(yj)qkj + ^2 Qkj 1 nq k j + A. 

3 = 1 i =1 

dF 

Hence, the equations —— = 0 take the explicit form 

OPk 


m 


m 


1 - A + ^2 Qkj 1 np(yj) = *22 Qkj ln( lkj, 1 < k < n. 

3 = 1 i =1 


(12.9.15) 


These are n equations, which together with the constraint Xqli P(Vj) — 1 
forms a System of n +1 equations with m+1 unknowns, p(yi ),... ,p(ym) and 
A. The right side of (12.9.15) is known, since the input-output matrix q^j is 
given. We need to solve for m+1 unknowns, p(yi), ... ,p(y m ) and A, which 
are in the left side. Using the Toeplitz property of q jy, the previous relation 
can be written as 


m m 

+ qkj (l - A + ln p(yj)) = ^2 Qkj ln qkj, 1 < k < n. 

3 = 1 i =1 

We shall write this System of equations in a matrix form. First, we introduce 
a few notations 


lnp(y) T = (lnp(yi),...,lnp(y m )) 

and h T — (hi,..., h n ), where Qkj 1 n^-. Then the previous System 

of equations becomes 

Q( 1 — A + lnp(y)) = h. (12.9.16) 

The input-output matrix Q — (qij) is of type n x m, with m < n. The solution 
method uses the idea of Moore-Penrose pseudoinverse. We prior multiply the 
previous equation by Q T and obtain 

Q T Q( 1 - A + lnp(y)) = Q T h. 

We notice that Q T Q is a squared matrix of type m x m. From the double 
inequality 

m < rank(Q T (5) < rank(Q) — m, 









Information Capacity Assessment 


389 


it follows that the matrix Q T Q is nonsingular. Therefore 

1 - A + lnp(y) = ( Q T Q)~ l Q T h . 

Using the Moore-Penrose pseudoinverse, Q + = {Q T Q)~ 1 Q T , see section G.2 
in Appendix, we write 

1 - A + lnp(y) = Q + h. 

Eqnating on components, we obtain 

l-A + ln p(yj) = (Q + h)j, 1 <j<m, 


with 

n 

(q + Oj = 

k =i 

where Q + = (#-) is the Moore-Penrose pseudoinverse of Q. Taking an expo- 
nential yields 

e 1 ~ x p(yj) — 1 < j < m. (12.9.17) 

Summing over j, then using Y^jLi P(Vj) ~ 1? an d taking the log function 
solve for A as 

rn 

1 - A = ln (52 e(Q+h) 9- (12.9.18) 

J = 1 

This formula produces A in terms of the input-output matrix. Substituting 
(12.9.18) back into (12.9.17) and solving for the output distribution yields 



Using the definition of the softmax function, the output distribution can be 
written as 

p(y) = softmax(Q + /i), (12.9.19) 

where the right side of the previous expression depends only on the input- 
output matrix. 

Our hnal goal was to find the input distribution, p(x), which satishes 

Q T p(x)=p( y). (12.9.20) 

If this equation has a solution p* = p(x), then by Exercise 12.13.4, it is 
unique. Furthermore, if all p* > 0, for all 1 < i < n, then this solution is 
in the interior of the definition domain and hence it is the point where the 
functional F achieves its maximum. 



390 


Deep Learning Architectures, A Mathematical Approach 


12.9.8 Finding the Capacity 

Assume we have succeed in finding a maximum point p* for F(p). Then we can 
compute the network capacity substituting the solution p* into the formula of 
mutual information /(A, Y). We start by multiplying formula (12.9.15) by p^ 

m m 

(1 - A )p k = ^ ^PkQkj 1 nqkj - Y Pkqk i ln P(%) 

3 = 1 J = 1 

and then sum over k to obtain 

n m n m 

l-A = ^2^2p k qkj^Qkj - YY Pkqk i lnp ( y ^ 

k= 1 j= 1 k=1j= 1 

n m m 

= YY p ( Xk ’ y ^ lnp ( y i l Xfc ) _ 52 Kyi) ln P(%) 

k =1 .7=1 .7 = 1 

= -ii r (y|A) + ii r (y) = /(x,y). 

Hence, using (12.9.18), the capacity of a network with = 1 and 

fixed weights W and biases b is given by 

m 

C(W, b) = l-X = lnfj2 e (Q+h) A (12.9.21) 

.7 = 1 

where is the Moore-Penrose pseudoinverse of the input-output matrix Q 
and hj — Ylr*=i Qjr^Qjr- It is worth noting that the capacity depends only 
on the input-output matrix Q — qij. 

12.9.9 Perceptron Capacity 

We have seen that a perceptron, as a computing unit with unit step activation 
function, is able to perform half-plane classifications. This section answers the 
question: How much information a perceptron can process? 

We shall show that the capacity of a single perceptron is 1 bit. This means 
that the outcome of a perceptron carries 1 bit of information, and hence it 
can learn the decision function of whether a point belongs to a given half- 
plane, who carries an information also of 1 bit. In order to do this, it suffices 
to compute explicitly formula (12.9.21) in the case of a perceptron. 

The perceptron input is given by a random variable, A, which takes only 
two values, x\ — 0 and X 2 = 1 , with probabilities a and 1 — a, respectively. 
The output variable, given by Y = H(wx + b), has also two outcomes, y\ — 0 
and y 2 — 1, taken with probabilities /3 and 1 — /3. The weight, w, and bias, 
5, are fixed constants, with w 7 ^ 0. If qij — P(Y — yj\X — xf), 3, j G {1, 2 }, 


Information Capacity Assessment 


391 


denotes the input-output matrix, the relation (12.9.20) between the input 
and output distributions can be written as: 


( 13 \ = ( qn q2i \ ( a \ 
\1 - ft) \£?12 <? 22 / V 1- "/ 


( 12 . 9 . 22 ) 


We shall compute next the entries qij. We have 


5 ii 


P(Y = yi \X = xi) = P(Y = 0\X = 0) = P(H(b ) = 0) 


f 1 , if b < 0 
\ 0 , if b> 0 “ {b<0} ’ 


because the event {H(b) = 0 } is occurs surely for b < 0 and becomes impossi- 
ble for b > 0. Using the same line of thinking, we compute the other entries as 


Q 2 i = P(Y = y i\X = x 2 ) 
512 = P(Y = y 2 \X = xi) 
522 = P(y = V 2 \X = x 2 ) 


P(Y = 0\X = 1) 

p(y = i|x = o) 
p(y = i|x = i) 


P(i7(w + 6) 
P(P( 6 ) = 1) 
P(P(w + b ) 


0) 1{iu+6<0}3 

1) 


Therefore, the transpose of the input-output matrix is given by 


q t = 


511 521 

512 522 


1{6<0} 1{k;+6<0} 


(12.9.23) 


We shall choose the weight, re, and bias, 6 , such that Q becomes nonsingular. 


Propositiori 12.9.2 Let w 7 ^ 0. If either w > 0, b E [— w,0), or w < 0, 
b E [ 0 , —w), then det Q 7 ^ 0 . Furthermore, Q — Q 1 — Q + — Q T — (Q T ) _1 . 

Proof: The entries of the matrix Q T are either 0 or 1 . Since the sum of the 
entries on each column of Q T is equal to 1 (by the Toeplitz property), the 
only two cases in which Q T is not singular are the following: 

1 ) “ d ° t= (? y 

Using (12.9.23), the hrst case corresponds to b < 0, w + b > 0, and the second 
to b > 0, w T b < 0, which are the hypothesis conditions. We note that the 
previous matrices are symmetric and are their own inverses. 


Since the entries of Q T are either 0 or 1 , we have 

(hi\ = ( 511 In 511+ 512 In 512 \ = f 0 \ 
\Ji 2 ) V 521 In 521 + 522 In 522 y \ 0 / 







392 


Deep Learning Architectures, A Mathematical Approach 


X 



Figure 12.7: Y is a compressed version of the input X , which preserves mean- 
ingful Information about target Z. Y preserves the important features of the 
face given by X, which are meaningful for the idea of face required by the 
target Z. 


where we used lim x\nx — 0. Substituting in (12.9.21) we obtain the percep- 
tron’s capacity 


2 

C{W, b ) = 1 - A = ln ( J2 e (Q+h) A = ln 2. 

3 = 1 


Dividing by ln 2 we change the measure units from nats to bits. Hence, the 
capacity is of 1 bit. 

We shall find next the input and output probability distributions. Since 
h — 0 and Q is invertible, formula (12.9.16) yields Q(ln2 + lnp(y)) = 0, i.e., 
p(y)) = 1/2, or equivalently, /3 = 1/2. Solving for a from (12.9.22) provides 
a — 1/2. Hence, both input and output probabilities are uniform when the 
maximum capacity is reached. 


Remark 12.9.3 (z) In the case of a feedforward neural network with noiseless 
layers where the input-output mapping, /, is bijective, with f(xi) — yi and 
m — n, the input-output matrix is the identity matrix, Q — I n , see Remark 
12.9.1. Then the pseudoinverse is = I n and (Q+h)j — hj — 0. Then 
the network capacity is given by (12.9.21) as C(w,b) — ln^iLi e ° = l nn - 
Formulas (12.9.19) and (12.9.20) provide that both the output and input 
distributions which maximize the network capacity are uniform 

1 

n 


p(xj ) = p(yj) 


5 


1 Y j < n. 














Information Capacity Assessment 


393 


(ii) Capacity can be increased by supplying a larger input information. 
Increasing the number of inputs leads to an increase in the number of weights 
and biases of the network. The capacity depends on the matrix Q, which 
depends on the network activation function and parameters. 

Capacity can be interpreted as the ability of a network to fit a large variety 
of target functions. Conseqnently, a network with a large capacity can lead to 
overhtting the training set by memorizing it. On the contrary, a low capacity 
network cannot process enough information and thus leads to an underht of 
the training set. The optimal capacity , which is difficult to hnd theoretically, 
is somewhere in between these two limits, being close to the true complexity 
of the task subject to be performed by the network. A related problem to 
finding the optimal capacity is the ‘Information bottleneck”, which will be 
treated in the next section. 

12.10 The Information Bottleneck 

The information bottleneck method is a technique introdnced by Naftali 
Tishby, Fernando C. Pereira, and William Bialek in [119]. We start illus- 
trating the idea using a couple of suggestive examples. 

Example 12.10.1 Assume that the input, A, represents face images and 
the target, Z, is the gender of the portrayed people. A face contains a lot 
of information, which needs to be sqneezed as mnch as possible through a 
bottleneck that preserves the relevant information about the gender. Thus, 
the meaningful information about Z might contain features such as hairstyle, 
presence of makeup, clothes color, presence of a beard, etc. These features, 
which are contained in the initial data A, should also come up in the outcome 
A and represent meaningful information for determining the gender of the 
portrayed person. 

Example 12.10.2 A speaker has to deliver a talk on a certain topic in a 
foreign language in which his vocabulary is limited. He needs to sqneeze the 
information of the talk, A, through a “bottleneck”, A, formed by the limited 
set of words available, such that the meaningful information about the talk 
topic, Z, is not affected. 

Example 12.10.3 An employee has to write a narrative about his new pro- 
posed project. However, his busy boss imposes a 2-page limit per project. 
Therefore, the employee’s challenge is to include a lot of information about 
the project in only a limited amount of space. The available project informa¬ 
tion, A, has to be compressed in a 2-page narrative, A, the bottleneck, such 
that the important project features are not weaken too much. 


394 


Deep Learning Architectures, A Mathematical Approach 


We consider a feedforward neural network with input random variable X 
and output variable Y, which is subject to learn the target random variable 
Z. We assume that the training distribution, p(x,z), which is the joint dis- 
tribution of (X, Z) is provided. The input X is usually a high dimensional 
variable, while the target, Z, has a signihcantly lower dimension; thus, most 
entropy of X is not very informative about Z, and the information provided 
by X has to be compressed as much as possible into the output Y, such that 
there is stili enough meaningful information left in Y about the target Z. 

Since lossy compression of X into Y cannot convey more information 
about Z than the initial data X, we have the inequality 

I(Y,Z)<I(X,Z). 

Namely, the information conveyed about the target Z by the output Y does 
not exceed the information conveyed about Z by the initial data X. However, 
even if it is less than /(X, Z), the information /(X, Z) should stili be large 
enough such that Y stili contains enough meaningful information about Z. 
Thus, we are looking for a network that keeps a fixed amount of meaningful 
information /(X, Z) about the target Z, while maximizing the compression 
of the input X into X, i.e., minimizing the information /(X, X). This is 
represented in Fig.12.7. The amount of relevant information captured by the 
network is described by the ratio p = /(X, Z)//(X, X), with p E (0,1). 

The bottleneck principle States that we need to pass the information pro¬ 
vided by X about X through a “bottleneck” formed by the output variable X, 
in the most efficient way. This means to minimize /(X, X) subject to a given 
fixed information /(X, Z). This can be formalized variationally by minimizing 
the functional with constraints 

C{p(y\x)) = I(X, Y) - (31 (Y, Z), (12.10.24) 

where /3 is a Lagrange multiplier. The hrst term denotes the compressed Infor¬ 
mation^ while the second term describes the meaningful information. There- 
fore, the multiplier /3 describes the trade-off between preserved meaningful 
information and compression. A value of /3 close to 0 emphasizes compression 
over meaningful information, in which case the representation becomes very 
sketchy. If /3 tends to oo, then meaningful information prevails over compres¬ 
sion, in which case the representation becomes very detailed. The argument 
of the functional is p(y |x), or equivalently, the input-output matrix. 

It is worth noting that this problem is distinet from the problem of hnding 
the network capacity, where /(X, X) is maximized over all input patterns 
p(x); here the information /(X, X) is minimized over p(y |x), while the input 
distribution, p(x), is kept fixed. 


Information Capacity Assessment 


395 


12.10.1 An exact formal solution 

Assume the input distribution, p(x), and the training distribution, p{x,z), 
are given (and hence, the posterior distribution , p(z\x) — p(x,z)/p(x), is 
given). The unknown distributions are p(y), p(y\x) and p(z\y). The distribu¬ 
tion p(y\x) is called encoder , while the distribution p(z\y) is called decoder. We 
need to hnd the output distribution as well as the encoder and decoder distri¬ 
butions such that the squeezing through the bottleneck procedure described 
before becomes optimal. The solution of the nonlinear variational problem 
(12.10.24) has an exact implicit solution, which can express p(y |x), p(z\y), 
and p(y) in terms of themselves and also in terms of the known distributions 
p{x) and p(z\x) (prior and posterior distributions). 

Theorem 12.10.1 The optimal solution that minimizes functional (12.10.24) 
satisfies the following implicit equations: 


p(y\x ) 
p{z\y) 

p(y ) 


pX) -0 DktAv(z\x)\\p(z\v)) 
Z{x,(3) 

22 p(z\x)p(y\x)p(x), 
'ffpiy\x)p(x), 


where 


D K L(p(z\x)\\p(z\y)) = E p(z\x) ln 

z 


p(z\x) 

p( z \y) ’ 


and Z(x,(5 ) is the normalization function. 


(12.10.25) 

(12.10.26) 
(12.10.27) 


Proof: We note that two of the unknown distributions, p(y) and p(z\x). can 
be written in terms of p(x), p(z\x) and p(y\x) as follows: 

p(y) = ^2 P(y\x)p(x), 

X 

p{z\y) = 22 p(z\ x )p(x\y) = —- 22 p{z\x)p{y\x)p(x), 

ry* -*■ \ ^ / ry* 

(Av tXj 

where in the last equation we have used that 


p{x\y) = 


P(x,y ) p{y\x)p(x) 


p(y) 


p(y) 


Therefore, we need to hnd a formula for p(y\x) in terms of the previous 
distributions. 








396 


Deep Learning Architectures, A Mathematical Approach 


For fixed y and z, the law of conditional probabilities provides 


p(y) = 'Yl l p{y\xi)p{xi) = ^2p(y\x)p(x) 


x 


p(y\ z ) = ^p{y\xi)p{xi\z) = ^2p(y\x)p(x\z) 


X 


and hence for a given x we have the following partial derivatives: 


dpp) 

dp(y\x ) 
dp(y\z) = 
dp(y\x ) 


= p(x) 
p{x\z). 


(12.10.28) 

(12.10.29) 


When optimizing over p(y |x), we need to add the constraint ^2 v p{y\x) — 1, 

t/ 

for any given x. If the Lagrange multiplier for the previous constraint is A(x) 
(it is x-dependent), then we obtain the following fnnctional with constraints: 

C(p(y\x)) = I(X, Y) - PI(Y. Z)-J2 A(*) ( ~ l) • (12.10.30) 

x y 


For hxed valnes of x and y, we have the variat ional equation 


dC 

dp(y\x ) 

Using that 


mxx) p dgXZ) 

dp(y\x ) Op(y\x) 



(12.10.31) 


XX,Y) = H(Y)-H(Y\X) 

= - ^2 p{y) p(y) + ^2 p(x,y) Inp(y\x) 

V x,y 

= -^22p(y) In p(y) + *22 p(y\x)p(x) ln p(y\x), 

y X,y 

chain rule together with (12.10.28) yield 

dI{X,Y) _ dH(Y) p(y) dH(Y\X ) 

dp(y\x) p(y) dp(y\x ) dp(y\x) 


Similarly, 


I(Y, Z) 


= —(1 + ln p(y))p(x) + p(x)( 1 + lnp(y\x)) 

p(y\x ) 


= p(x) ln 


p(y) 


H(Y) — H(Y\Z) 

~22 p ( y ^ ln p(y) + ^2p(y\ z )p( z ) in p(y\ z )- 

y y,z 













Information Capacity Assessment 


397 


and applying chain rule together with (12.10.29), we have 


dI(Y, Z ) 
dp(y\x) 


= —(1 + 1 np(y))p(x) + 2 fp(z)( 1 + lnp(y\z))p(x\z) 

Z 

— —p(x) — p{x) ln p(y) + ^2p{z)p(x\z) + ^^p{z)p{x\z) \np{y\z) 

z z 

— —p(x) ln p{y) + z) \np{y\z) 

z 

= z) ln p(y) + y^pix, z) lnp(y\z) 


= ^P(x, z) ln = P( x ) ^fp ( z \ x ) ln 

= p(x)J2p( z \ x ) ln yyp 


Z 


Substituting into (12.10.31) yields 


dC 


, n, p(v\ x ) ( i m p(*y) V®) 

p/ I V =p(*)iln—-/?> # ln-p:-pr 

wf i p{y) , p(z) p(®) 


The term f3 2 _,p(z\x) ln 


gOjy) 

p(z) 


depends on both y and x. We shall write it 


as a sum between two Kullback-Leibler divergences as 

< i m p( z \v ) ( i m f p( z \y) p( z \ x ) 

p^p{z\x) ln— = /^p(z|x)ln( ^^ ^ 


Z 


p(z\x) p(z) 

As 

P 2 in + Z 5 2 ln yyp 


where 


p(z\x) 

As As 

= -f3D KL (p(z\x)\\p(z\y)) +/3(f>(x), 

4>(x) = y ^P(z\x) ln P ^^ = T>xL(p(^k)|b(^)) 

p[z) 


z 


is a function of x, which will be absorbed into the multiplier. Substituting in 
the previous partial derivative yields 


dC 


dp(y\x) 


= p(x) j ln^kl + /3T)^L(p(2|x)||p(z|y)) - /?0(x) - 


p(y) 


p(x) J 























398 


Deep Learning Architectures, A Mathematical Approach 


Introducing the new multiplier 


X(x) — 1 3<j)(x ) + 


\{x) 

p(x) 


we can finally write the derivative in the following simpler way: 

^ = p(x){ lnh^M + f3D KL (p(z\x)\\p(z\y)) - \(x) 


dp(y\x) 


p{y) 


Therefore, the equation 


dC 


dp(y\x) 


— 0 is satished by the solution 


p(y\x) 


p(y)e^ x ^e 

p(y) - 

Z{!3,x) e 


-(3D KL (p(z\x)\\p(z\y)) 

(3D KL (p(z\x)\\p(z\y)) 


where Z((3,x) — is the normalization function. We skip the proof of 

the fact that this solution minimizes the given functional; for this the reader 
is directed to the paper [119]. ■ 


Remark 12.10.2 The implicit System (12.10.25)-(12.10.27) can be solved 
using an iterative algorithm. Denoting the iteration step by fc, we consider 
the iterative System 


Pk(y\x) 

Pk+i(z\y) 

pk+i(y) 


Pk(y) e -0DKr,(Pk.(z\x)\\pk(z\y)) 

Zk{x,f3 ) 

x)p k (y\x)p k (x), 

Z Pk ^ x "> Pk ^' 


X 


It can be shown that the sequence of Solutions is convergent as k oo to 
the desired solution. 


12.10.2 The information plane 

This section will provide a geometric description for the evolution of the 
information through the layers of a deep feedforward neural network. This 
way, the shape of the associated curve will teli how the information flow 
behaves. 

We recall a few familiar notations. As usual, let X — and Y — X^ 
be the input and output layers of a deep feedforward neural network. Denote 










Information Capacity Assessment 


399 



Figure 12.8: The information path associated with a neural network in the 
Information plane. 


by Z the target variable and by the activation of the £th layer of the 
network. Each layer, X^\ can be associated with two nonnegative numbers, 


I$> = I(XW,X), 


f(t) 

1 Y 


I(X^,Y), 


which can fnrther be considered as coordinates in a Cartesian plane, called 
the information plane. Thus, each layer of the network can be mapped into 
a point in the first qnadrangle on this plane. We shall investigate the shape 
of the curve made with these points. 

Data processing ineqnalities, see Proposition 12.7.9, provides a double 
sequence of inequalities 

H(X) > I(X W ,X) >■■■ I{X W ,X) > /(X (£+1) , X) >■■■> I(X, Y) 


I (X, Y) < I(X (1 \y) <■■■ /pcg Y) < I(X (m) , Y) <■■■ < I{X^ L ~ l \ Y), 

which state that the sequence Ip is decreasing, while Iy ] is increasing with 
respect to l 

r(t) \ At+ 1) j(t) ^ M+ 1) 

1 X — 1 X > 1 Y — 1 Y 

Connecting the points (7^, l{p) with a monotonic continnous curve, we 
obtain the information path associated with the network, see Fig. 12.8. 

It is worth noting that dnring the learning process, while the weights and 
biasses are tuned, the information path deforms in the information plane. 










400 


Deep Learning Architectures, A Mathematical Approach 



Figure 12.9: Regions of different types of compression on the information 
curve. 

Another remark is that in the case of a network with compressionless layers 
the previous first sequence of inequalities becomes a sequence of identities, 
and hence the information path takes a vertical position. The reciprocal of 
the slope describes the compression rate along the network, see Fig. 12.9. 

We note that for any feedforward network we have I^ — Iy^ — 4(X, Y). 

If in addition, we also have 1^ p ^ = Iy\ for ali 1 < p < L — 1, then the 
information curve is symmetric with respect to the line Iy = Ix- This curve 
would correspond, for instance, to a symmetric autoencoder network. 

Even if the geometry of the information path provides relevant informa¬ 
tion about the network, however, it does not determine the network architec- 
ture univocally. This can be easily seen if one of the layers is modified by an 
invertible transformat ion, which changes the network architecture but leaves 
the mutual information invariant, see Proposition 12.7.5. 

Remark 12.10.3 Bottleneck principle can be applied repeatedly, layer by 
layer. Applying the bottleneck principle at each layer, we would like to min- 
imize the information I(X^\ while keeping Z ) as large as 

possible. This is to compress the information between the ^th and ('i + l)th 
layers, while keeping enough meaningful information on Z . 


12.11 Information Processing with MNIST 

The efficiency of a neural network always starts with a test on the MNIST 
data. This is a database consisting of gray-scaled handwritten digits from 









Information Capacity Assessment 


401 


accuracy 

Q.«Q 

naso 

0 640 

0 600 

0.100 

OJKi 





















■■ -— * 



- 






/ 

jf 

ir 
/ 








( 

































































OEO 2 fflKK iflKk 6 S03W; tD-Mti * 2 00*1 14 03« I# Cfflit 1ADGK 


Figure 12.10: A zero-hidden layer feedforward network using a batch size of 
30, the softmax activation function, and the gradient descent method with 
learning rate X — 0.03, producing a testing accuracy of 92.5%. The network 
uses a cost function given by the sum of squares. The diagram was generated 
with Tensorboard. 

0 to 9, together with a label, indicating the intended integer. Each handwrit- 
ten digit is represented by a 28 x 28 pixel image. The labeis are represented 
by “one-hot vectors”, which are vectors of length 10 having a 1 placed on 
the slot with the same index as the corresponding digit, and zeros in rest. 2 
MNIST data is divided into 55,000 training data, 10,000 testing data, and 
5,000 validation data. A network is trained using the training data set and 
then is tested on the testing data, and a percentage of success is recorded, 
as the ratio between the correct classihed test samples and total nnmber of 
samples, which we shall call accuracy. The goal is to achieve an accuracy 
as close as possible to 100% on the test data, case which corresponds to a 
perfect classification of the test data. 

Experimental work has shown that the accuracy of a neural network 
depends on a variety of factors, snch as: number of hidden layers (network 
depth) and their size (network width), activation function used (logistic, 
ReLU, etc.), cost function (sum of square errors, cross-entropy, etc.), batch 
size, learning step, method of minimization (gradient descent, etc.), as well 
as the type of the network (fully-connected, convolutional network). Further- 
more, we can state empirically some bounds for the accuracy as follows: a 
two-layer feedforward neural network (FNN) has a maximum of about 92.5% 
accuracy on MNIST data, see Fig. 12.10; a three-layer FNN cannot do better 


r\ 

For instance, the labeis 0 and 3 are represented by the one-hot vectors 
(1, 0, 0, 0, 0, 0, 0, 0, 0, 0) and (0, 0, 0,1,0, 0, 0, 0, 0, 0), respectively. 













402 


Deep Learning Architectures, A Mathematical Approach 


accuracy 

OMC 

9»W 

0 »» 

5 »! 

OMO 

OH! 

o»so 

0 »«! 



0000 2 00» <00» 6 00» 600» 100» 120» UO» 160» 160» 200» 


Figure 12.11: A one-hidden layer feedforward network, with 300 neurons in 
the hidden layer, trained with a batch size of 40, using the ReLU and the 
softmax activation functions for the hidden and output layer, respectively, 
produces a testing accuracy of 97.6%. The learning uses ADAM method with 
learning rate starting at X — 0.0015. The cost function used is the cross- 
entropy. The diagram was generated with Tensorboard. 


than 98% accuracy on MNIST data, see Fig. 12.11; a convolution network 
can exceed 98% accuracy on MNIST data. 

The aim of the next section is to address this phenomenon mathemati- 
cally. 


12.11.1 A two-layer FNN 

We shall flatten each 28 x 28 image into a vector, X , of length 784, following 
the idea of Fig. 5.2. This is considered as a random variable with values in 
[0, l] 784 . A component with a value of X{ = 1 corresponds to a black pixel, 
while X{ — 0, to a white pixel. Any other value in between corresponds to a 
gray pixel. 

The network has no hidden layers, so there are only two layers, one cor- 
responding to the input and the other to the output. 

The network output is a decision function with 10 components, which is 
described by the vector Y — (Yi,..., Yio). Assume the activation function <f> 
is invertible. 3 Then the output is described by the formula 


3 Here we assume, for instance, that the activation function is of sigmoid type. There are 
activation functions which are not invertible, such as ReLu, or Softmax, which is customarily 
used in classihcation. However, Softplus, which approximates ReLU, is invertible. 



Information Capacity Assessment 


403 


784 

Yj = </> (wjjXj + &Q, i < j < io, 

i=l 

or, in an equivalent matrix form, Y = cj)(W T X + 6). We used the notation 
tY = (^j) G R 784 x R 10 for the weight matrix and b — (fr 7 ) G R 10 for the 
bias vector. 

Since Y depends on X, the field of information dehned by Y is included 
in the field of information dehned by X, namely, 


©(Y) c 6(X). 


This is a striet ineqnality, namely, the inverse inclusion, ©(X) C 6(n 
does not hold, as we shall show next. Denoting by T — (j) 1 Y — 6, we have 
W T X — T, where W T is the transpose of the matrix W. Variables Y and 
T dehne the same information field, ©(Y) = ©(T), since T can be obtained 
from Y by an invertible transformat ion. The aforementioned System can be 
written explicitly as 


W\ lXi + 7 C 12 X 2 + • • • + 7^1,784^784 = T\ 


WlO,lXi + 1^10,2-^2 + • • • + ^10,784-^784 — ^lQ. 


Since rank(fY) < min{10,784}, without losing generality, we shall consider 
the best information transfer scenario when the rank is maximal, rank(fY) = 
10. Retaining only the independent rows and using an eventual reindexing, 
we write 


784 

TCnXi + IC 12 X 2 + • • • + WiyoXiQ — T[ — E w i,j x j 

j =n 

784 

TClO,lXi + 1 Cio , 2-^2 + • • • + 7 ^ 10 , 10-^10 = ?10 ~ ^ ™10,jXj. 

3 =11 


Since the coefhcient matrix is invertible, this implies that the hrst ten com- 
ponents of X can be written in terms of T (and hence, in terms of Y) and 






404 


Deep Learning Architectures, A Mathematical Approach 


the other components of X. Consequently, 

6(Xi, ..., Xio) C 6(T, Xn, ..., X 784 ) = &(Y, Xn, • • •, X 784 ) 
Obviously, 

6 (Xn,... 5 X 784 ) c &(Y,X 11 ,... 5 X 784 ). 

Then the last two relations imply 

6 (X) = 6(X U , X 10 ) V ©(Xn,..., X 784 ) C 6(Y, X n ,..., X 784 ). 

(12.11.32) 

On the other side, we know that 

6(Y) C 6 (X) 


and it is obvious that 


6(X 11 ,...,X 7 84) C6(X). 


From the last two relations we infer that 

6(Y) V6(Xii,...,X 7 84) C 6 (X), 
or, after taking the 6 operator 


6(Y, Xn,..., X 784 ) C 6(X). (12.11.33) 

From (12.11.32) and (12.11.33) it follows that 

6 (X) = 6(Y, Xn, • • •, X 784 ). (12.11.34) 

To conclude, the information casted by the output variable, Y, is strictly 
smaller than the inpnt information defined by X. This corresponds to a 
compression of information, during which the information generated by Xn, 
..., X 784 is lost. 

A similar approach works for the case when the matrix W has a smaller 
rank. In this general case, a number of 784 —rank (W) components of X need 
to be included in the right term of (12.11.34). 

In the following we shall provide a qnantitative analysis of information 
through the network in the line of section 12.8.5, which is of practical interest 
for comparing the information processing ability of neural networks. 

For the sake of simplicity, we further assume that each pixel is allowed 
to take only two values, black or white. The maximum entropy of the input 
variable X satisfies 2 H ( X ^ — 784, namely, the input contains an information 


Information Capacity Assessment 


405 


of at most 9.614 bits (per image). The entropy of X, given T, is equal to the 
average number of pixels corresponding to each output 


O H(X\Y) 


784 

~io~’ 


which provides H(X\Y) — 6.292 bits. The mutual information of X and Y, 
i.e., the number of bits contained in X about Y, is given by 


/(X, Y ) = H(X) - H(X\Y) = 9.614 - 6.292 = 3.322, 


namely, about 3.32 bits of information of each image are used toward the 
classification of the picture content. The number of classes provided by Y is 
given by 

2 i ( X ’ Y ) = 2 3 - 322 = 10 , 

as expected. 

Writing now the mutual information as 


I(X,Y) = H(Y)-H(Y\X), 

and using that there is no uncertainty in Y if X is given, i.e, H(Y |X) = 0, 
( Y depends deterministically on X), it follows that H(Y ) = /(X, Y) — 3.322 
bits. This verifies the relation 2 H< A ) = 10, which recovers the known fact that 
the number of elements of Y is 10. 

To conclude, the input and output entropies are H(X) — 9.614 bits and 
H(Y ) = 3.322 bits, respectively. The information loss due to compression is 
6.292 bits (per image). If this is viewed in the light of formula (12.11.33), then 
it corresponds to the information lost through the components Xn,..., X 784 . 


12.11.2 A three-layer FNN 

Consider a three-layer FNN, with input X, output Y, and an extra hidden 
layer, t/, containing 100 neurons. In this case two compressions occur: one 
from X to [/, under the ratio 784 : 100; and another, from U to Y, under 
the ratio 100 : 10. Experiments have shown that the accuracy of the net- 
work increases from 92% (which corresponds to a zero-hidden layer network) 
to about 98%, when one hidden layer is used. The customary explanation 
of this fact is that the hidden layer is able to collect more features of the 
input variable, adding some extra-capacity to the network. We shall provide 
a mathematical formalization of this fact. 

Let Y and Y be the outputs for a zero-hidden layer and a one-hidden 
layer network, respectively. The output of the one-hidden layer network is 



406 


Deep Learning Architectures, A Mathematical Approach 


where (W, b ) and (W, b ) are the Systems of weights and biasses for the first- 
to-second layer and for the second-to-third layer, respectively. The output Y 
can be represented as a point in a space of dimension 79, 510, parametrized by 

(w, W,b, b ) € R 784x100 x R 100xl ° x R 100 x R 10 . 


Similarly, the output Y — <f)(W T X+b) of the zero-hidden layer neural network 
can be represented as a point in a space of dimension 7, 850, parametrized by 

(W,b) € R 784x10 x R 10 . 


When the neural networks are optimized, their outcomes correspond to pro- 
jections of the target variable Z onto the aforementioned spaces. Conse- 
qnently, a larger dimension will produce a better approximation of the target 
variable Z by projections, which means the network accuracy tends to be 
higher in the case when we introduce a hidden layer. 


12.11.3 The role of convolutional nets 

The low performance of a fully-connected layer feedforward neural network 
used in the classification of the MNIST data is due to two kinds of information 
losses: 

(i) One is due to the low capacity of the two-layer network. This can be hxed 
by adding more layers or more neurons in the hidden layer to increase the 
network capacity. However, there is an upper bound of abont 98% for the 
network accuracy in this case, which cannot be exceeded, regardless of how 
wide the hidden layer is, or how many hidden layers are added to the network. 

(ii) To gain the missing 2% we need to acknowledge another information loss 
in the input data, due to flattening out the image. This removes some of the 
2-dimensional information, such as the relation of a pixel with its neighboring 
pixels. Hence, a neural network with an architecture which takes advantage of 
the 2-dimensional data structure is needed, and this is the convolution neural 
network (CNN). Chapter 16 will discuss this type of networks in more detail. 

How mnch information is ignored when the 2-dimensional input data is 
flattened out into a vector in the case of the MNIST data? One way to 
attempt to answer this question is to asses the information provided by the 
surprise of misclassifying digits. This information can be assessed using the 
log-likelihood function. 4 

4 The information contained into an event is large when the element surprise brought 
by that event is also large. For instance, if p is the probability to snow, then — lnp is the 



Information Capacity Assessment 


407 


T 

h 

i 

s 


i 

s 


a 


t 

e 

x 

t 


i 

n 


E 

n 

g 

i 

i 

s 

h 











































a b 

Figure 12.12: (a) The Information is given only by the horizontal component: 
H(Xh, X v ) — H(Xh); ( b ) The information has both horizontal and vertical 
components: H(Xh, X v ) — H(Xp) + H{X V ) — /(X^, X v ). 


Let p denote the probability of classifying correct the MNIST data. If 
the accuracy is 100%, i.e., the network classifies correctly ali digits, that is 
p — 1, so the log-likelihood is — log 2 P — 0, i.e., there is no misclassihca- 
tion information in this case. If the accuracy would be of 50%, this means 
that the network classifies correctly one out of two digits, on average. Since 

— log 2 (l/2) = 1, there is 1 bit of information characterizing the misclassi- 
fication. But if the accuracy classification is 92%, then 92 digits out of 100 
are classified correctly. This corresponds to a misclassification information of 

— log 2 (0.92) = 0.12 bits. In the case of 98% accuracy, the misclassification 
information is — log 2 (0.98) = 0.029 bits. The conclnsion is that after flatten- 
ing the 28 x 28 pixel data into a vector of length 784, the most efficient FNN 
will stili ignore about 0.029 bits per MNIST image. This loss is dne to the 
change in topology of the figure. 

In the following we shall explain this information loss using entropies. We 
need to introduce first the concept of horizontal and vertical reading variables. 
When one reads an English text, the information is extracted horizontally, 
line by line, from left to right. If the text is in Chinese, then the information 
is extracted vertically, column by column, from top to bottom. 


information casted by this event. During the wintertime the probability to snow can be 
p = 0.8, which corresponds to an information of — ln0.8 = 0.22. During the summer the 
probability to snow is extremely small, let’s say p = 0.001. This means an information of 
— ln(0.001) = 6.9. 







































408 


Deep Learning Architectures, A Mathematical Approach 


However, when one looks at a picture of a human face, the information 
in this case cannot be considered only horizontally or vertically, since the 
relative position each part with respect to all the other parts play a role in 
building the information as a whole. 

We shall denote by X \ and X v two random variables denoting the hori- 
zontal and the vertical reading of an image, respectively (For instance, when 
reading a doc file, we read a sequence of characters, which are considered as 
a random variable, X^). 

When reading an English text, the observed information is contained 
into the reading variable X^ and is equal to its entropy, H(Xh). In this 
case the components X ^ and X v are independent, with H(X v ) = 0, i.e., no 
information is provided by the vertical component, see Fig. 12.12 a. 

On the other side, when reading a Chinese text, the information is provided 
by H(X v ). The components X ^ and X v are stili independent and H(Xh) — 0, 
i.e., the horizontal component does not affect the information in this case. 

Since looking at a 2-dimensional image, like an MNIST image, requires 
a 2-dimensional awareness for the neurons, this produces an information 
H(Xh, X v ), which is the joint entropy of the variables (Xh,X v ), see 
Fig. 12.12 b. The components X^ and X v are not necessarily independent 
in this case. The component X\ is obtained flattening out the 28 x 28 pixel 
image, by a concatenation of rows, while X v is obtained similarly, by a con- 
catenation of columns. The mutual information between the horizontal and 
vertical components is given by 

I{x h ,x v ) = H(X h ) - H{X h \X v ) = H(X V ) - H{x v \x h ). 

We note that in the case of reading a document (either in English or Chinese) 
we have I(Xh,X v ) — 0, due to components independence. This no longer 
holds true for the case of a 2-dimensional picture. 

The total information of an image is given by the joint entropy of its 
components, H (X^, X v ). When the image is flattened into a vector, the infor¬ 
mation is extracted only from the horizontal component, and this is H (X/J. 
The lost information is given by the difference between the total information, 
H(Xh,X v ), and the partial information extracted from the flattened vector, 
H(Xh), as in the following: 

C = H{X h , X v ) - H{X h ). (12.11.35) 

Using that H(Xh)+H(X v ) — H(Xh, X v ) — /(X^, X v ), see equation (12.7.12), 
the lost information becomes 

C = H(X h ,X v )-H(X h ) = H(X v )-I(X h ,X v ) 

= H{X V ) - (H(X V ) - H(X v \X h )) 

= H(X V \X h ). 


Information Capacity Assessment 


409 


Hence, the lost information is the uncertainty of the vertical component, X v , 
conditioned by the horizontal component, and this is H(X v \Xh). 

Remark 12.11.1 In general, the lost information depends on whether the 
vertical or the horizontal component entropy is subtracted from the total 
information H(Xh, X v ). However, if the vertical and horizontal components 
have the same entropy, the information loss is the same in both cases. 


12.12 Summary 

This chapter provides some tools for assessing numerically the information 
contained in the layers of a neural net. Each layer output can be seen as a 
signal whose entropy can be evaluated. Necessary conditions for the entropy 
flow to be decreasing are provided. 

Mutual information is used to describe quantitatively the information 
conveyed by one layer about another layer. The invariance property and data 
Processing inequalities regarding the mutual information are proved and used 
in applications to DNNs. 

The ability of a network to fit a large variety of target data is called 
capacity. It depends on the size of layers, depth, and number of neurons. 
It also represents the maximal information produced by a network under 
variable input information. A large capacity may lead to an overht, so a 
technique that retains from the input only that part of information which is 
relevant to the target variable is developed. 

Bottleneck information is a method that compresses the input informa¬ 
tion as much as possible such that the output contains enough meaningful 
information about the target set. 

Examples to numerical computation of information measures in the case 
of the MNIST data are provided. 


12.13 Exercises 

Exercise 12.13.1 Consider a continuous random variable X with the density 
function p(x) — l[e,oo) ( x ). Show that H(X) is infinite. 

Exercise 12.13.2 Let X be a continuous random variable with density p(x). 
Show that: 

(a) If Var(X) < oo, then H{X) < oo; 

( b ) If p(x) < M < oo, then H(X) > — ln M. 



410 


Deep Learning Architectures, A Mathematical Approach 


Exercise 12.13.3 (a) Define the mutual information of X and T, given Z as 

/(X, Y\Z) = H(X\Z ) + H(Y\Z) - if(X, Y\Z). 

Show that /(X, Y\Z) — Dkl[p{%, U-> z)\\ p( x \ z )p(y\ z ) • 

( b ) Show that for any three random variables X, Y, and Z , we have: 

h(x\z ) + if(y|z) < #(x, y \z). 

When is the identity satished? 

Exercise 12.13.4 Let b G R m and A be an n x m matrix with rank m, where 
m < n. Show that the linear system AX — b has at most one solution X G R n . 

The next few exercises refer to section 12.9. 

Exercise 12.13.5 Find the input distribution p(x) in terms of the matrix Q 
in the case n — m. 

Exercise 12.13.6 Under the assumptions and notations of section 12.9, show 
that the capacity satisfies the following inequality: 

C(W,b) > ma x((Q T Q)~ 1 Q T h)j. 

3 

Exercise 12.13.7 Consider a neural network obtained by the concatenat ion 
of two perceptrons. The output of the network is given by the random variable 

Y = H(w 2 H( Wl X + b 1 ) + b 2 )), 
with X G {0,1}. What is the capacity of this network? 

Exercise 12.13.8 (a) Show that the number of parameters of the neural 
manifold associated with the two-layer FNN described in section 12.11.1 is 
7,850. 

(■ b ) Show that the number of parameters of the neural manifold associated 
with the three-layer FNN described in section 12.11.2 is 79,510. 

(c) Which of the previous networks has a larger capacity and why? 

Exercise 12.13.9 How does the capacity of a network change when: 

(a) An extra fully-connected layer is added to the network; 

( b ) Some neurons are dropped out of the network; 

(c) The weights are constrained to be kept small. 


The next few exercises refer to section 12.11. 



Information Capacity Assessment 


411 


Exercise 12.13.10 Assume the ranks of matrices W and W are maximal. 
(a) Show that 

6(X) = &(Y , U n ,..., Uioo,Xioi ,..., X^)- 

(■ b ) Verify the relation 

(5 (E) 5 • • • , X 100 5 -^"lOl 5 • ' ' 5 -A 784 ) ^ (A, ^11 5 • • • 5 ^100 5 A~]_01 5 * * • 5 A^784) * 

Exercise 12.13.11 With the notations of section 12.11.2 show that 
H(X\U) — 2.9708, /(X, U) — 6.643 and verify the inequality 

H(X) > /(A, U) > I(X,Y). 

Exercise 12.13.12 Show that we have H(Xh\X v ) — H(X v \Xh) if and only 
if H(X h ) — H(X V ). 

Exercise 12.13.13 Consider the information loss, >C, given by (12.11.35). 
(a) Find C given that the vertical and horizontal components are indepen- 
dent. 

(■ b ) Assume the vertical component depends deterministically on the horizon¬ 
tal component. Find C. 

An image transformation is represented as an invertible function 
(X' h ,Xy) — <I>(Xh,X v ). This means that pixels change their coordinates, 
while keeping constant their total number. The question is how does a pixel 
shuffling change the entropy of an image? Since the question is too general 
to have an exact answer, we shall restrict the question in the next exercise 
to transformations of some particular type. 

Exercise 12.13.14 Consider transformations of MNIST data of type 

{x' h ,x' v ) = (MXh),<h{Xv)), 

with <fii invertible. Show that these transformations preserve the total entropy 
if and only if H(X' h ) + H(X' V ) = H(X h ) + H(X V ). 

Exercise 12.13.15 For any p G (0,1) consider the binary entropy function 

H(p) — —plnp — (1 — p) ln(l — p). 

(a) Show that H(p) is the entropy associated with a Bernoulli random vari- 
able. 

(■ b ) Verify the following relation between the derivative of the binary entropy 
and the logit function: 

d JM = - In(-P-). 
dp VI — p/ 




412 


Deep Learning Architectures, A Mathematical Approach 


Exercise 12.13.16 (decomposition into hierarchical levels) The mutual 
information of n random variables is defined as 

n 

I(Xu ...,X n ) = Y J H(X k ) - H(X U ..., X n ). 

k=1 

(a) Show that I{X,Y,Z) = I(X,Y) + l((X,Y), Z); 

( b ) State and prove a general statement. 


Exercise 12.13.17 Consider the symmetric entropy of two discrete random 
variables X and Y to be given by d(X,Y) — H(X\Y) + H(Y |X), and dehne 

D(X, Y ) = 4S> 
v ; H(X,Y) 

(a) Show that d(X, Y) = H(X, Y) - I(X, Y)-, 

(b) Show that d is a distance function, i.e., it is nonnegative, symmetric, and 
satisfies the triangle inequality 


d(X, Y) + d(Y, Z) > d(X, Z ) 


vx, y, Z\ 


(c) Verify that D(X, Y) = 1 - 


I(X,Y) 


H(X, Y) 

(d) Show that D is a distance function with D(X, X) — 0 and D(X , Y) < 1 
for any pair (X, T). 

Exercise 12.13.18 (Translation invariance and scaling equivariance) 

Let X be a random vector of dimension n, A be an invertible n x n matrix, 
and b be a constant vector. 

(а) Show that H(AX + b) = H(X) + ln | det X|; 

(б) What is the application to the linear neuron? 


Exercise 12.13.19 Consider a matrix W — ( W{j ) E M nxn such that 
c, with c < -J=. Show that det VF <1. 

V n\ 


w ij 


< 


Exercise 12.13.20 Let X\ ^ A/*(/ii, crf) and X 2 ^ A/"(/i 2 , cr|) ^e two normal 

distributed random variables, with Pearson correlation p — Cov ( Xl X 2 ) . 

r <J 1 <J 2 

that 


I(X 1 ,X 2 ) = -bn(l-p 2 ) 


and provide an interpretation. 



Information Capacity Assessment 


413 


Exercise 12.13.21 If X is a random variable with the cumulative distribu- 
tion function Fx, then it can be shown that U — Fx(X) is a random variable 
uniformly distributed on [0,1]. Let Fx x x 2 denote the joint distribution func¬ 
tion of (Xi, X 2 ). The copula of (Xi, X 2 ) is dehned as the distribution function 

of (UyU 2 ) 


C{u\,U 2 ) = P(C/l < Ui , U2 < U 2 ), 


with Uj — Fx (Xj). The density of copula is given by c(iq, U 2 ) — 


d 2 C(ui , u 2 ) 


duidu 2 

(a) Prove the following relation between copula and mutual information: 


/(Xi,X 2 ) = / / c(uy u 2 ) lnc(i£i, u 2 ) duidu 2 - 

J 0 J 0 

This formula resembles the joint entropy formula, with the probability density 
replaced by the copula density. 

(6) Use part (a) to show that /(Xi,X 2 ) = 0 for X\ and X 2 independent. 


Exercise 12.13.22 One preprocessing technique used in neural networks is 
the normalization of the input data, X. This means transforming it into a zero 
mean random variable with unit Standard deviation by the transformation 

1 = X-E[X]_ 

What is in this case the relationship between H (X) and H(X)? 



Part IV 
Geometric Theory 



® 

Check for 
updates 


Chapter 13 

Output Manifolds 


In this chapter we shall associate a manifold with each neural network by con- 
sidering the weights and biasses of a neural network as the coordinate System 
on the manifold. This manifold can be endowed with a Riemannian metric, 
which describes the intrinsic geometry of the network. Viewing a neural net¬ 
work in this geometric framework is useful from the following points of view. 

(i) The optimal weights and biasses of a network correspond to the coordi- 
nates of the orthonormal projection of the target onto the manifold. 

(ii) Each learning algorithm involves a change in parameters value with 
respect to time and corresponds to a curve on this manifold. The most effi¬ 
cient learning process corresponds to the shortest curve, or geodesic, between 
the initial point and the projection point of the target. 

(iii) The regularization problem can be treated in terms of the mean curva¬ 
ture and second fundamental form of the manifold into the ambient target 
space. Namely, since the flattest manifold produces the least overhtting to 
training data, regularization can be viewed as finding the output manifold 
with the smallest curvature. 

In the next section we shall present an overview of the concept of manifold 
and present some results useful for deep learning. The reader interested in 
more introductory details on Differentiai Geometry is referred to [85]. 


13.1 Introduction to Manifolds 

A manifold is a geometrical space which resembles, at least locally, with 
the numerical space R n . Each point in the manifold is described by a set 
of n parameters, which are considered as local coordinates. The number of 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_13 


417 



418 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.1: The transitions functions <^o0. 1 of a manifold are differentiable. 

parameters, n, is the dimension of the manifold. The manifold retains its own 
identity regardless of the parametrization. Sometimes, a manifold cannot be 
dehned by only one parametrization and several parametrizations are needed 
to cover the entire space. 

The manifold is called differentiable if the parametrizations are differen¬ 
tiable, i.e., the parameters are assigned to points in the manifold in a dif¬ 
ferentiable way. In the case of several parametrizations, <fi : Ui —> A4, their 
transition functions, <fi o cfj 1 have to be differentiable, see Fig. 13.1. Besides 
this, a differentiable manifold is also required to satisfy a regularization condi- 
tion for all parametrizations: the Jacobian of the transition functions <fi o fj 1 
has a maximum rank. This condition removes any cusps, corners, or vertices 
on the manifold. 

We note that for the scope of this book, where we need to model neural 
networks, we shall not use manifolds with more than one parametrization. 
Therefore, we do not go in too much detail about multiple parametrizations 
manifolds. We shall present in the following a few examples of manifolds. 

Example 13.1.1 (Manifold of circles) The set of circles in the plane, 
C, can be organized as a manifold using three parameters: the center coordi- 
nates, (a, 6), and the circle radius, r. The parameters space is U — R 2 x (0, oo) 
and the manifold parametrization is </> : U —> C, where 0(a, 6, r) is the cir¬ 
cle centered at (a, b ) and having radius r. In this case the manifold C is 
parametrized by only one map, <f>. Each element of the manifold is a circle 
and the manifold has dimension 3. 

Example 13.1.2 (Manifold of lines) Let P be a given point in the plane, 
with coordinates (xo, yo), satisfying xo ^ 0 and yo 0. The family of lines in 
the plane, £, passing through the point P, can be considered as a manifold in 
the following way. Let a and /3 denote, respectively, the x-intercept and the 
y-intercept of the lines in the previous family, see Fig. 13.2 a. Then <f : R P, 








Output Manifolds 


419 


where 0(o) is the line passing through P and (o, 0), is a parametrization for 
all non-horizontal lines in C. Similarly, 0 : R —> £, where 0(/?) is the line 
passing through P and (0,/?), is a parametrization for all non-vertical lines 
in C. In this case the manifold C is one-dimensional but it is dehned by two 
parametrizations, 0 and 0, since none of them can cover completely the entire 
manifold. The transition fnnction 0 o 0 _1 : R\{xo} R\{yo} is given by 

0 o 0 (a) — p — - 

a — xq 

and is bijective and differentiable. Thus, the manifold C becomes a differen- 
tiable manifold. 

Example 13.1.3 (Surface manifold) The unit upper half-sphere 

s + = {(ui,u 2 , z); (ui,u 2 ) € B( 0 , 1 ), z = (l-u\- u 2 2 ) 1/2 } 

is a manifold of dimension 2. Its parametrization is given by 0 : £(0,1) R 3 , 

with 0(^i, U 2 ) — (^ 1 ,^ 2 , (1 — u\ — see Fig. 13.2 b. The parameters 

space is the open disk £?(0,1). It is worth noting that the entire unit sphere 
is also a 2-dimensional manifold, but in this case there are (at least) two 
parametrizations needed (for instance, the stereographic projections from 
the poles to the planes tangent to the sphere at the opposite pole). 

If the coordinate z can be written as z = /(^ 1 ,^ 2 ), with / : U. -P R 3 
differentiable, with IA open set in R 2 , then the manifold is a surface, called a 
Monge patch. The surface is parametrized by two real numbers, u\ and U 2 - 

Example 13.1.4 (Manifold of matrices) The set of 2 x 2 matrices with 
real entries, A42,2(R) 5 forms a 4-dimensional manifold. The parametrization 
is given by 0 : R 4 A42,2(K), 


0(a, 6, c, d) 


a b \ 
c d ) ' 


The set of 2 x 2 diagonal matrices with real entries, ^2,2 (M), forms a 
2-dimensional manifold. The parametrization is 0 : R 2 Al2,2(R), 


0(a, d ) — 



The matrix is parametrized by only two real numbers, a and d. In fact, 
'D 2 , 2 (K) is a submanifold of M 2 , 2(^)1 as a subset which inherits the ambient 
manifold structure (the coordinates). 



420 


Deep Learning Architectures, A Mathematical Approach 




Figure 13.2: a. The manifold of lines passing through the point P(x o, yo) and 
their axes intercepts. b. The upper half-sphere. 


Example 13.1.5 (Manifold of densities) Consider the set of all one- 
dimensional Gaussian probability densities 

0 = {P/vrSAi G K, o- > 0}, 

i (x-A 2 

where p ucr (x) — r— e 2 <x 2 x G R. The set Q becomes a 2-dimensional 
manifold parametrized by p and a. 


The next example is of a special signihcance for the subject of this book. 


Example 13.1.6 (Manifold of sigmoid neurons) Consider a sigmoid neu¬ 
ron with an n-dimensional input x G R n and the one-dimensional output 
y — cr(re T x + 6), where w G R n and b G R are the weights and the bias of the 
neuron. We take a to be the logistic function. Then the set of outputs 

S = {cr(ie T x + 6); w G R n , b G R} 


can be regarded as an (n-fl)-dimensional manifold, parametrized by w and b. 
In the following we shall verify the regularization condition by showing that 
the columns of the Jacobian matrix of y — y(w,b) are linearly independent. 
Using the properties of the logistic function, we have 



a'(w T x + b) = y( 1 - y) 
a'(w T x + b)xj = y( 1 — y)xj, 

















Output Manifolds 


421 


where x T = (xi, ..., x n ) and w T — (uq,..., w n ). Consider the vanishing 
linear combination of output functions 


o 


dy_ 

db 




CYfc G R. 


Since y{ 1 — y) ^ 0, the previous relation becomes 


n 

(Xq + ^ ^ otjXj — 0 . 

3 = 1 

Since this relation holds for any Xj G R, it follows that «o = ol\ — • • • = a n = 0. 
(To show this, we choose all Xj = 0 to get ao — 0; then take xj — Sjk to 
obtain = 0). Therefore 

dy dy \ 

\ db' dwi ' " ' dwj 

are linearly independent. This implies that the Jacobian matrix, J y , has rank 
n+1. We may picture the differentiable manifold S as an (n+ l)-dimensional 
smooth surface (no corners or cusps) in the inhnite-dimensional space of 
functions on R n . 

Assume now that the neuron is trained to approximate the continuous 
target fnnction z = z(x). If z is a point on the manifold, z G S, then there 
are some parameters values, tc* G R n and 6* G R, such that we have the 
exact representation y(rc*, &*) = z. However, in general most target functions 
satisfy the condition z ^ S. In this case, we need to hnd by training the values 


(tc*, 6*) = argmindist(z, <S), 

w.b 

which correspond to the coordinates of the orthogonal projection of z on the 
surface S. The distance is measured, for instance, in the mean sqnare sense. 
Starting from the initialization (tco, &o ) 5 a learning algorithm should produce 
a seqnence of approximations, (w n ,b n ) n , which converges to the projection 
coordinates, lim n ^ 00 (7c n , b n ) — If the parameters update is made 

continuously (implied by an infinitesimal learning rate), then we obtain a 
curve c(t) = (w(t),b(t)) joining (tco,&o) and (w*,b*). This can be lifted to 
the curve y(t) — y o c{t) on the manifold S. The fastest learning algorithm 
corresponds to the “shortest” curve between y(wo,bo) and y(7c*,6*). The 
attribute “shortest” depends on the intrinsic geometry of the manifold <S, 
and this topic will be discussed in the next section. 

This chapter will extend the manifold ideas from this example to the 
general case of a neural network. 



422 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.3: The tangent space T p Ai. 


13.1.1 Intrinsic and Extrinsic 

A manifold can be viewed from two distinet perspectives: intrinsic and 
extrinsic. 

The intrinsic point of view is the perspective of a local observer living 
on the manifold. The observer’s knowledge is bound to a local System of 
coordinates. Thus, the distance measured on the manifold, tangent vectors, 
their magnitudes, and angles between them belong to the intrinsic geometry 
of the manifold. One can picture this point of view as the perspective of 
an ant, which lives on the manifold, but does not have any access to the 
information outside of it. 

The extrinsic perspective represents the knowledge about the manifold 
acquired by an observer, who looks at the manifold from the exterior. Geo- 
metric concepts, such as normal vector and manifold shape, are extrinsic. 
This can be pictured as the perspective of a satellite, which can observe the 
manifold from outside. 

In the following we shall present a few intrinsic and extrinsic concepts of 
differential geometry on differentiable manifolds. 


13.1.2 Tangent space 

If the manifold Ai has a local parametrization <j) : IA —> Ai, with IA C R n 
open set, then the vector tangent to the kth coordinate curve 

h —y cj)(x \,..., Xfc /i,..., Xj^j 


is given by 



p — 4>(x). 


If the manifold satisfies the regularity condit ion that 0 has a Jacobian matrix 
of maximum rank at p, then the tangent vectors (Ti(p),..., T n (p)} are linearly 



Output Manifolds 


423 


independent. They form the basis of a linear space of dimension n, called the 
tangent space to Af at p, denoted by T p A4, see Fig. 13.3. 

If c(t) is a curve on the manifold Af, then in a local parametrization the 
velocity along c(t ) is a tangent vector to the manifold, which is given by 

n 

c(t) = J ^c k (t)T k (c(t)). 
k=1 

A tangent vector field , C/, is a smooth assignment of a tangent vector 
U p E TpM , at each point p E Af. In local coordinates this writes as U p — 

ELi ^ fe (p)r fc (p). 


13.1.3 Riemannian metric 


The main intrinsic concept on a manifold is the Riemannian metric. This 
is given in a local System of coordinates by a symmetric, positive definite, 
nondegenerate matrix, ( gij)ij , which is point dependent. Its entries represent 
weights used in measuring the distance between neighboring points on the 
manifold by a procedure similar to the Pythagorean theorem. Thus, if P and 
P' are two neighboring points on the manifold, having the coordinates (xi) 
and (Xj), the distance between them is measured in terms of the Riemannian 
metric as 

d(P,P') = ( 2 9ij( Ax )i( Ax )j) , 

h3 = 1 

where (A x)i — — x[, and gij — gij{P). In the particular case of a Euclidean 

space, R n , when the coordinates are orthonormal, the metric matrix becomes 
the identity matrix, gij — 5ij. The distance is now measured by the usual 
Pythagorean theorem, 

'n -i t o 

d(P,P’)= (E( A A) • 

i=1 

By a similar procedure we can measure the magnitude of a vector v tan¬ 
gent to the manifold as 


v 



(E*^r 

hj= 1 


where v J denotes the jth component of the vector in a local System of coor¬ 
dinates. If g is the bilinear form dehned on a basis {T\,... ,T n } in T p A4 by 
g(Ti,Tj ) = gij, then the previous formula can be written as 


v 


g = g(v,v) 1/2 . 










424 


Deep Learning Architectures, A Mathematical Approach 


The coefficients gij are historically referred to as the coefficients of the first 
fundamental form g. 

On a Euclidean space this becomes the following familiar formula for the 
length of a vector: 



hj= 1 


If now c(t) denotes a curve lying on the manifold and parametrized by the 
variable t, its velocity vector, c{t), is a tangent vector to the manifold, whose 
magnitude is given by 

H 1/2 

l|c(t)|| s = ( E 9 ij(c(t))c(t) l c(tyj = g(c(t),c(t)) 1/2 . 

i,3 =1 


Integrating the speed, ||c(£)||, with respect to time, t, we obtain the length of 
the curve measured with respect to the Riemannian metric 




C(t)\\gdt, 


where a < t < b. The Riemannian distance between two points A and B on 
the manifold Ai is dehned as the length of the shortest curve joining these 
points 


d(A,B) 


inf{L(c); c : 


a, b ] —> Ai, c(a ) 


A,c(b) = B}. 


c 

The pair (Ai,g) is called a Riemannian manifold. This chapter will describe 
the neural networks using Riemannian manifolds. 


13.1.4 Geodesics 


Another intrinsic concept is the notion of geodesic. This is the shortest curve 
on a manifold between two given points. If the points are close enough, there 
is always a geodesic between them and this is unique. In local coordinates 
the geodesic c(t) can be described by a System of nonlinear eqnations as 

c k (t ) + y^r£-c l (i)c*(i) = 0, 1 < k < n, 


hj 


(13.1.1) 


where Tfj = T^(c(t)) are the Christoffel symbols of second kind , see, for 
instance, [85] 

1 / dgi r dgj r dgij 


-p/c _ y kr ( u air dgjr 

ij ~2 g \dxj dx % 


dx. 


(13.1.2) 


where (x \,..., x n ) represent the local coordinates on the manifold. Since they 
depend on the metric coefficients gij, the Christoffel symbols are intrinsic, and 
hence the geodesics are too. Two simple examples are given next. 













Output Manifolds 


425 


1. In the Euclidean space (R n ,5ij) the derivatives of the metric coefficients 
5ij are zero and the geodesics equations become c k (t ) = 0. This implies that 
the geodesics are straight lines. 

2. In the case of the 2-dimensional sphere 8 2 the geodesics are ares of great 
circles. The geodesics equations are more complicated to be given here, the 
reader being referred to a book of differential geometry. 

The explicit computation of geodesics can be done on a very few particular 
types of manifolds and it is a relatively complex process. 

13.1.5 Levi-Civita Connection 

We shall start with the case of the Euclidean space. Let / be a differentiable 
function on R n and v a vector. The directional derivative of / with respect 
to v is defined by v(f) = (v, V/), where (V/) T = (d Xl f,d Xn f) denotes 
the gradient of /. The object v{f) represents the rate of change of / in the 
direction of v. 

Let U — (C/ 1 ,..., U n ) = Y2i=i be a vector field on R n . The derivative 
of U with respect to the vector v is defined as 


n 

V v U = {viu 1 ),... ,v(U n )) = v{U i )e i . (13.1.3) 

1=1 

Now, let V — (V 1 ,... ,V n ) be another vector field on R n . We define the 
covariant derivative of U with respect to V as 

(V v U) p = V v U, 

where v — V p and the term in the right side is defined as in (13.1.3). We note 
that \7yU is a vector field on R n , which associates to each point p G M n the 
vector (Vy[/) p . 

It is not hard to show that for any differentiable function / on R n and any 
vector helds t/, V , and W on R n , the following properties hold: 

(i) V fu V = fVuV, 

{ii) Vu{V + W) = VuV + Vt/W, 

(m) Vu(fV) = U(f)V + fVuV. 

The next property provides a compatibility between the derivation and 
inner product and it is a consequence of the product rule: 

(iv) W (U, V) = (VwU, V) + (i U , VwV), 

The noncommutativity of the covariant differentiation is given by the 
formula 


(0 VuV -V v u =[U,V}, 


426 


Deep Learning Architectures, A Mathematical Approach 


where [U, V] — UV — VU is the commutator of the vector fields U and V. 
It can be shown by a direct computation that the commutator is always a 
vector field with components given by 

n 

[i u , vy = u{v { ) - v(ir) = ]T ( e A) ui - e i(V)^)- 

3 = 1 

Consider now a Riemannian manifold (Ai,g). A linear connection, V, on 
the manifold Ai is an operator acting on vector fields of Ai satisfying the 
aforementioned properties (i)-(iii). There are many linear connections on a 
manifold. They are independent on the Riemannian structure g. However, 
there is only one linear connection that also satisfies properties (iv) and (v). 

Theorem 13.1.7 There is a unique linear connection on the Riemannian 
manifold (Ai,g) such that 

Wg(U,V) = g(V w U,V) + g(U,VwV), 

VuV-VyU = [U,V], 

for ali vector fields U,V,W on Ai. 

This is called the Levi-Civita connection on (Ai, g). Since the metric g deter¬ 
mines uniquely this connection, the Levi-Civita connection is an intrinsic 
concept. For a proof of the existence and uniqueness of this connection the 
reader is referred, for instance, to Chapter 7 of the book [22]. 

It is worth noting that the Levi-Civita connection is an intrinsic concept, 
since it depends on the metric coefficients gi 3 through the Christoffel symbols 
(13.1.2) as 

V TiTj = J2 r i j Tk, 
k 

where {T \,..., T n } is a basis of the tangent space. 

13.1.6 Submanifolds 

Let Ai and S be two manifolds, such that S C Ai and S is endowed with 
the indnced topology and differentiability structure from Ai. Then any Rie¬ 
mannian metric g on Ai induces a Riemannian structure on S as 

h(U,V)=g ls (U,V), 

where U , V are vector fields on S and h — g^ s is the restriction of g on vector 
fields of S. Then ( S , h) becomes a Riemannian submanifold of (Ai, h). 


Output Manifolds 


427 



Figure 13.4: Gauss formula \7 jjV — (Vt/V")11 + L(C/, R). 


Let n and m be the dimensions of M and <S, respectively. Then for any 
point p E S the tangent space T p S is a linear subspace of dimension m of the 
linear space T p M. We can consider the orthogonal split 

T p M = T p S®V p , (13.1.4) 

where V p — {v;g(v,u) — 0,Vix E T p M}. This means that for any vector 
w E T p M there are two orthogonal vectors u E T p S and v E V p such that 
w — u + v. 

13.1.7 Second Fundamental Form 

The shape of a submanifold with respect to the ambient manifold is described 
by its second fundamental form. Let S be a submanifold of the Riemannian 
manifold (M, g) and denote by V the Levi-Civita connection on (M, g). Then 
for any two vector helds U and V on the submanifold S the vector field VjjV 
is not necessarily a vector beld on S. In general, we have according to (13.1.4) 
the following orthogonal decomposition: 

(VuV) p = (VuV) Jj + (VuV)^ (13.1.5) 

with (VuV)i E T p S and (VuV) p E V p . Relation (13.1.5) is called Gauss 
formula, see Fig. 13.4. 







428 


Deep Learning Architectures, A Mathematical Approach 




a b 

Figure 13.5: The second fundamental form is normal to the surface and 
describes its shape: a. The case of a sphere. b. The case of a cylindrical 
surface. 


It can be proved that the operator V = V 11 is the Levi-Civita connection 
on the submanifold (<S,u 5 ). 

The second fundamental form of the submanifold S with respect to Ai is 
dehned by 

L(U,V) = (y u V) ± (13.1.6) 

for any [/, V vector fields on 5, see Fig. 13.5. 

Propositiori 13.1.8 Let U,V,W be arbitrary vector fields on S. The second 
fundamental form satisfies the following properties: 

(i) L is symmetric, L(U , V) — L(V, U); 

(ii) L is bilinear: L(U + W, V) = L(U, V) + L(W, V ); 

(iii) L(f\U, ffiV) — /i/ 2 T(C/, V) for any two differentiable functions fi and 
/2 on S. 

Proof: (i) Using the properties of the Levi-Civita connections V and V we 
show that L is symmetric 

L(U, V ) - L(V, U ) = (v v V - VuV) - (v v U - V v A 

= (VuV-VyU) - (VuV-VvA 

= [U,V]-[U,V] 

= 0. 

















Output Manifolds 


429 


The other properties can be easily verified and are left as an exercise to the 
reader. ■ 

It is worth noting that property (m) States that if the vector helds U and 
V get scaled by the functions /i and / 2 , then the second fundamental form 
gets scaled by the product / 1 / 2 - 

13.1.8 Mean Curvature Vector Field 

Let (5, h) be a submanifold of the Riemannian manifold (A4, g), with h = g |5 , 
and denote the second fundamental form by L. For any hxed point p G 5, 
we consider a vector basis {Ti,..., T m } in the tangent space T p S and dehne 
Lij — L(Ti,Tj), which is a normal vector to 5, for any 1 < i, j < m, since 
Lij G V p . Let hij — h(Ti,Tj ) be the coefficients of the Riemannian metric on 
S and h u denote the coefficients of the inverse matrix (hij) -1 . 

The mean curvature vector at p is deffiied by the following “contraction”: 

H p = h ij (p) L ij. 

i,3 

We have H p G V p and the mapping p H p dehnes the mean curvature vector 
held of <S, which is normal to S at each point. 

The deffihtion simplffies a little if we assume the initial chosen basis 
orthonormal. If {£ 1 ,..., E m } is an orthonormal basis in T p S , then 
h(E u Ej ) = Sij and the mean curvature vector at p becomes 

m 

H V = Y. S v L ( E i’ E i) = £ HEu Ei)- 

i,j i=l 

The mean curvature vector field is an extrinsic notion, since it depends on 
the second fundamental form. 

Second fundamental form and the mean curvature measure different types 
of extrinsic curvatures of submanifolds. For instance, if the second fundamen¬ 
tal form is zero, — 0, then the submanifold S is called totally geodesic. 
This means that any geodesic of S is also a geodesic of AI. To understand 
this concept the reader should picture the particular case of a plane included 
into the 3-dimensional space: any straight line in the plane is also a straight 
line in the space. This is compatible with the fact that the plane does not 
bend in the space. 

If the mean curvature vector held is zero, H — 0, than S is called a 
minimal submanifold of At. The geometric interpretation is that S has locally 
a minimal volume; that is, if the manifold is perturbed locally, then its volume 
increases. The concepts of second fundamental form and mean curvature will 
be useful for regularization purposes later. 



430 


Deep Learning Architectures, A Mathematical Approach 


13.2 Relation to Neural Networks 

The reader might have inquired what is the relation between the differential 
geometry concepts presented so far and neural networks. This section briefly 
discusses this relation, while the later sections will present a more detailed 
analysis. 

The role of a given neural network is to approximate a certain target 
function z. We assume that z is an element of a target manifold , At, snch 
as the manifold of continuous functions on [0,1]. The output y of the neural 
network is parametrized by 6 — (re, 6), the weighs and biasses of the network. 
This way, the output y belongs to an output manifold , 5, which is supposed 
to be a submanifold of the target manifold, AT The dimension of the sub- 
manifold S is eqnal to the number of network weights and biasses, while the 
dimension of AI in this case is infinite. 1 It is worth noting that for practical 
applications the target is a vector = (zi fact that implies that 

the manifold Af has dimension n. 

The dimension of the output manifold can be increased by adding more 
neurons, and hence more parameters. The larger the dimension the better the 
approximation. However, also the shape of the submanifold S plays a deter¬ 
minant role in trying to avoid overhtting. The second fundamental form, L, 
describes how does the submanifold S bend inside of Af. From the regular- 
ization point of view, we prefer submanifolds that bend as little as possible, 
so that the orthogonal projection of the target function z onto S is eventually 
unique and easy to find by the gradient descent method. 

As an example, we shall consider the case of the manifold given by Exam- 
ple 13.1.6. The target manifold in this case can be chosen to be the space 
Af = C[ 0,1]. The manifold of outputs 

S = {cr(re T x + 6); w G R n , b G R} 

is an {n+ l)-dimensional submanifold of Af . At each point y G S the tangent 
space T p S is spanned by linear combinations of the functions 


{y( i - y), 2/(1 - y)x i, y)x n }. 


The submanifold S can be seen as an (n + l)-dimensional hypersurface 
inside the space of real-valued continuous functions At = C([0,1]). In the 
case of one sigmoid neuron the target space Af is not well approximated by 


1 A system of parameters for a continuous function defined on [0,1] is the set of rational 
numbers Q P\ [0,1]. 



Output Manifolds 


431 


the surface <S, since there are continuous functions f e M whose distance to 
the surface S cannot be made arbitrarily small. Equivalently stated, using S 
to approximate A4 leads to an underfit. 

However, increasing the number of neurons in the network leads to a larger 
number of parameters, and hence to a higher dimensional approximation 
manifold S. The hope is that for any element of the target space, / G A4, 
and any fixed e > o, there is a network that produces a manifold S of high 
enough dimension such that dist(/, S) < T where the distance is measured by 

— inf max | f(x) — s(x) . 

sGS xG[0,1] 

The dimension of S can be obviously increased by adding more neurons until 
the desired approximation holds. However, in practice the neurons supply 
might be limited, fact that leads to the problem of maximizing the dimen¬ 
sion of S , while keeping the number of neurons constant. We shall deal with 
this problem in the next section. 



13.3 The Parameters Space 

Assume we are supplied with a fixed number, N, of computing units and we 
need to construet a feedforward neural network using these units as hidden 
neurons, such that the network acquires its maximum capacity, i.e., it has 
a maximum ability of approximating target functions. Specihcally, we shall 
look for that network architecture which considers only N hidden neurons and 
produces a maximum dimension for the output manifold S. This is obtained 
when the number of network parameters is maximized. 

We shall start with an example. Assume we are provided with IV = 10 
hidden neurons and consider the following three feedforward network archi- 
tectures, ordered from shallow to deep: 

(i) only one hidden layer with IV = 10 hidden neurons in that layer; 

(ii) 2 hidden layers with 5 hidden neurons in each layer; 

(iii) 5 hidden layers with 2 hidden neurons in each layer. 

For the sake of simplicity, we assume that both the input and the output 
are one-dimensional. 

The network given by (i) has 30 parameters: 10 weights Wj (from the 
input to the hidden layer); 10 biasses bj (one for each hidden neuron); 10 
weights otj (from the hidden layer to the output). See Fig. 13.6. 

The network given by (ii) has 45 parameters: 5 weights Wj (from the 

input to the first hidden layer); 5 2 weights (from the first to the second 
hidden layer); 5 weights oq (from the second hidden layer to the output); 
10 biases bj (one for each neuron in each hidden layer). See Fig. 13.7 a. 



432 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.6: A one-hidden layer neural network with N — 10 hidden neurons, 
which depends on 30 parameters. 


The network given by (iii) has 30 parameters: 2 weights Wj (from the 
input to the first hidden layer); 16 intermediate weights (4 between each two 
hidden layers); 2 weights otj (from the last hidden layer to the output); 10 
biasses bj. The network is pictured in Fig. 13.7 b. 

We conclude that the maximum number of parameters is reached in the 
case (ii), which is neither the shallowest, nor the deepest neural network con- 
sidered. We shall show that this is a typical behavior for general feedforward 
neural networks architectures. 

We consider now a neural network with L — 1 hidden layers (the layers 0 
and L are reserved for the input and output layers, respectively). As usual, we 
denote by d^ the number of neurons in the £th layer. For simplicity reasons, 
we choose d^ = d^ — 1. Since the number of hidden neurons is equal to 
N, we have 

dw = n. 

i=i 


(13.3.7) 


Output Manifolds 


433 



Figure 13.7: Two neural nets with N — 10 neurons: a. A 2-hidden layer neural 
network depending on ^5 parameters. b. A 5-hidden layer neural network 
depending on 30 parameters. 

The number of weights between the layers l — 1 and l is d^~^d^\ The 
total number of biasses is equal to N, one for each hidden neuron. Then, the 
total number of parameters, inclnding weights and biasses, is given by 

d (0 V (1) + d (1) d (2) + • • • + d^~ l) d^ + • • • + d^ L ~ l) d^ L) + N. (13.3.8) 

The problem of hnding the feedforward neural network of maximum 
capacity can be now formulated equivalently as: 

What is the number of layers L, and the number of neurons, d^\ in each 
hidden layer t, with 1 < £ < L — 1, such that the expression (13.3.8) reaches 
its maximum, given the constraint (13.3.7)? 

This problem has the following geometric signihcance. Each product term 
d^-^d^ is interpreted as the area of a rectangle. Then, starting from a rect- 
angle with dimensions d^ x d^ , we continue the construction of a rectangle 
with dimensions d ^ x d( 2 ) as in Fig. 13.8. The even sides are displayed ver- 
tically, while the odd ones are horizontal. The entire figure can be inscribed 
into a rectangle 7 Z having the width equal to d ^ + d ® + • • • and the height 
given by d^ + d^ + • • •. The constraint (13.3.7) is geometrically equivalent 
with the fact that the sum of the dimensions (i.e., width and height) of the 
rectangle 7 Z is constant, equal to N. The goal is to maximize the sum of the 
rectangles area, given the aforementioned constraint. 

One approximative approach is to maximize the area of the rectangle 7 Z 
first. This occurs when the width is equal to the height, i.e., when it becomes 
a square. Then we try to fili in the entire square 7 Z by rectangles using 









434 


Deep Learning Architectures, A Mathematical Approach 



d {0) 

d i2) 



Figure 13.8: The area significance of the sum + d^d^ + • • • + d^ 8 '. 

the previous construction such that we fili in most of it. A combinatorial 
argument shows that the optimal construction occurs only when we have two 
hidden layers. In this case the situation looks like in Fig. 13.9 a. If we have 
more layers, let’s say 3 layers, then the construction is not optimal because 
there is more unfilled space left, see Fig. 13.9 b. The reader should be able 
to fili in the missing details of the argument. 

The problem has an exact mathematical solution if we assume from the 
beginning that 

d(D=d(2) = ... = <*(£-!), 

namely, when each hidden layer has the same number of neurons. For sim- 
plicity, let k = L — 1 denote the number of hidden layers in the network, so 
each hidden layer has N/k neurons. Then the number of weights given by 
(13.3.8) becomes 


N /7V\ 2 N 

fN{k) = J + {k ~ 1 ) ij) + ~k +N ' 

This will be optimized with respect to k as in the following. We start by 
rewriting the expression in terms of 1 /k as 

Mk) = 

= j ( 2 + N ~ N l) +N - 



















Output Manifolds 


435 



Figure 13.9: a. The blank rectangle is a square of area — 1. b. The 

blank rectangles have an area larger than 1. 


Substituting u — 1/fc, we obtain a quadratic function in u 

4b N {u ) = Nuh, + N - Nuj + N = -N 2 u 2 + N(N + 2)u + N. 


The maximum of 4 ) n( u ) is reached for 


N(N+ 2) _ N + 2 
2iV2 “ 2N 


This corresponds to a number of hidden layers given by 

, 2N 

k = WT2' 


Even if this number is not always an integer, for a large number of neurons 
N, the optimal number of hidden layers is well approximated by k — 2. This 
explains why in the case when 7V = 10, having a network with two hidden 
layers, each having 5 neurons, achieves the maximum capacity. 

The theoretical maximum number of parameters is given by I n 

fact, this is equal to the valne 


N 2 

M 2) = —+ 2 N. (13.3.9) 


Therefore, the maximum dimension of the output manifold S grows quadrat- 
ically in the number of hidden neurons N. We shall show next that this may 
lead sometimes to an overht of data. 

The previous formulas have been developed in the particular case of net- 
works satisfying d^ — d^ = 1. For the general case, see Exercise 13.8.2. 





















436 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.10: The orthogonal projectiori of z on the manifold S is y(o*). 


13.4 Optimal Parameter Values 

Consider the training set {(aq, 2 q), (# 2 , ^ 2)5 • • •, ( x n , z n )}, and let yi be the 
network’s one-dimensional output corresponding to the one-dimensional input 
Xi, with 1 < i < n. Then yi — yi(0), where 6 E R r is the parameter vector 
of the network. Therefore, the vector y T = (yi,..., y n ) E R n is parametrized 
by 0, and hence describes a manifold S in R n of dimension r. In fact, the 
output manifold S is an r-dimensional surface in R n . Training the network 
is equivalent to hnding an exact, or approximate valne of 0 *, for which the 
distance from z T — (zi,..., z n ) to S is minimum, namely 


0* = argmindist(z, S) = argmin ||z — y(0)|| 
6 0 


where the distance is measured in the Enclidean sense. This is equivalent with 
the fact that y(0*) is the orthogonal projection of z on 5, see Fig. 13.10. 

If the network has two hidden layers, then the previous analysis provides 
r — + 2N , where N is the number of hidden neurons. If N is such that 

^ + 2N > n, then the network exhibits an overht, since it memorizes the 
entire training set. We shall fili in some mathematical details of this fact. 

Assume we have just eqnality, + 2N — n. Then the submanifold S has 
the same dimension as the target space M n , and hence it is either the entire 
space, or a subset of it. Consequently, we can choose parameters 6 such that 
the point z belongs to the manifold 5, fact that implies the aforementioned 
distance eqnal to zero. The details of this implication are given next. The 









Output Manifolds 


437 


system of n equations 

2/1 ( 0 ) = zi 

• • • • • • 

Vn(^) — Zn 

has n unknowns, 6 — ($i,... , 0 n ). We can invert the system assuming the 
regularity hypothesis 

det (&0 ^°’ 

which geometrically corresponds to stating that the output manifold 5 has 
a tangent plane at each point (there are no corner points or thorns on the 
manifold). Then the system has a unique solution, 0* = 0*( y,z). This shows 
that z is a point on the manifold 5, having the corresponding parameter 0*. 

AT “2 

We note that the inequality + 2A^ > n cannot hold because this would 
mean that the submanifold has a dimension larger than the space itself. 
Equivalently, this means that the aforementioned system has a number of 
unknowns 6 larger than the number of equations, n. This would imply that 
the weights and biasses cannot be independent, which is a contradictory 
statement for a neural network. 

Example 13.4.1 We shall provide an example applied to the MNIST data 
set, which comes with n — 55, 000 training data, {(x^, z^)}. There are some dif- 
ferences from the previous theory, since each input data, x^, is 784-dimensional 
(each 28 x 28-image is flatten out into a 784-dimensional vector), while each 
output, z is 10-dimensional (there are 10 digit classes). These modifications 
shall be taken into account when calculating the dimension of the output 
manifold. In the following we shall take N = 500 neurons and consider the 
following feedforward neural network architectures: 

(i) 784-500-10: 1-hidden layer network with N = 500 neurons; 

(ii) 784-250-250-10: 2-hidden layer network with 250 neurons in each layer; 

(iii) 784-50-50--••-50-10: 10-hidden layer network with 50 neurons in each 
layer; 

In this example we have d^ — 10, because each target vector, z^, is 10- 
dimensional. Since there are 55,000 targets, this yields a space with dimension 
55,000 xlO = 550,000. Therefore, the output manifold S is a submanifold of 
the numerical space M 550 > 000 . We shall discuss next the dimensions of the 
output manifolds for the aforementioned architectures. 

In case ( i ) the dimension of the output manifold S is 


r = 784 x 500 + 500 x 10 + 500 = 397,500 



438 


Deep Learning Architectures, A Mathematical Approach 


since there are 784 weights from the input to the hidden layer, 500 biasses, 
and 500 x 10 weights from the hidden layer to the output. The accuracy in 
this case is about 97.2 percent and the test error is about 1.49. The small 
test error indicates the optimality of the network (neither overfitting nor 
underfitting). 

In case (ii) the dimension of the output manifold is 

r = 784 x 250 + 250 x 250 + 250 x 10 + 500 = 261,500, 

which is about half of the dimension of the target space. The accuracy of the 
network in this case is 96.5 percent, which is lower than the previous network. 
And the test error, which is 565, is larger than the one of the previous case. 

In case (iii) the dimension of the manifold is 

r = 784 x 50 + 9 x 50 x 50 + 500 + 500 = 62,700, 

which is about 17 times smaller than the dimension of the target space. 
This leads to an underfit of data, fact suggested by the test error, which is 
1,027. The accuracy in this case is only 94.5 percent. We notice a decrease in 
accuracy as the network gets deeper. 

These computations have been executed using 4,000 iterations using a 
batch size of 40. We employed the Adam optimization algorithm, which 
decreases the learning step as we get closer to the optimum point. The cost 
function used was the sum of squares errors. 

Learning with information There is another way to characterize the opti- 
mal parameters, using the concept of information fields. Consider a two-layer 
neural network with input X and output Y, with an increasing activation 
function </>. The output is related to the input by 

Y = (f)(W T X + B). 

Since is invertible, then W T X + B — 0 _1 (T), and by Proposition D.5.1 
we have &(Y) — &(W T X + B). Since adding constants does not change the 
information held, see Exercise 11.10.1, we have &(Y) — &(W T X). Hence, 
the output information, £ = &(Y) is independent of the bias B but depends 
on the system of weights VE. Since (VE, B) are coordinates on the associ- 
ated output manifold, we can associate with each point on this manifold the 
parameter-dependent sigma-field £ — £ WB - The information is preserved in 
the direction of coordinate E>, i.e., it is conserved along slices {VE = constant}. 

In the case of a three-layer neural network, with input X and output Y, 
the output is related to the input by 

Y = 4>(w {2)T 4>{W T X + B) + 


Output Manifolds 


439 


Similarly with the previous computation, we can show that the output infor- 
mation is 

£ = e(Y) = &(W {2)T '<p(W T X + B)). 

This depends on the weights VE, W^ 2 \ and the bias in the first layer, B , but 
is independent of the bias in the last layer, B^ 2 \ 

In general, the output manifold of any feedforward neural network can 
be endowed with an information structure. Each point on the manifold is 
parametrized by some weights and biasses to which we associate an informa¬ 
tion field. 

If Z, £ wb and Z denote the input, output and target information fields, 
respectively, we assume the double inclusion holds 


£ w , b czci. 


Given the information £ WB , then the target variable, Z, is best approximated 
by the conditional expectation ~E[Z\£ W B \. The optimal weights are given by 


(ve*,h*) 


arg min 

W,B 


Z - E[Z\£ WiB 


L 2 - 


The network output corresponding to the weights (VE*, B*) is the best approx¬ 
imator of the target Z in the information sense. 

We shall point out next the exact learning case. Assume now that, for some 
parameters (VE, F>), we have 


z c S WtB c X. 

Then any target variable, Z, i.e., a random variable that is Z-measurable, is 
also £ WB -measurable, case in which Z = K[Z\£ W B ]. This corresponds to an 
exact learning, since Z is completely determined by the output information 
of the network. 

13.5 The Metric Structure 

We have seen that the number of hidden neurons, 7V, determines the dimen- 
sion of the output manifold S. This has the maximum dimension in the case 
when there are two hidden layers. However, if r is smaller than the maximum 
value, there might be several manifolds of the same dimension r, which are 
associated with different feedforward network architectures. 

For instance, the encoder and decoder networks represented in Fig. 13.11 a, 
b depend on the same number of parameters and have the same number of 
hidden neurons. However, their task is very different. This is explained by the 
fact that the architecture of the network (which here means the sequence of 







440 


Deep Learning Architectures, A Mathematical Approach 



a b 

Figure 13.11: Two symmetric networks, which depend on the same number 
of paramet er s: a. An encoder. b. A decoder. The number of paramet er s is 59; 
there are 14 biases (one for each neuron) and 45 weights. 


numbers d^) produces different metric effects on the manifold S. The geo- 
metric shape of S depends on the sensitivity of the network output y with 
respect to the weights and biasses. Thus, the output tends to be less sensitive 
to weights situated in the hrst layers, closer to the input and more sensitive 
to weights situated in the last layers, closer to the output. 

We may say that a neural network is associated with the approximation 
manifold <S, and to train the network means to hnd the minimum distance 
from the target point z to the manifold S. Hence, the geometric properties of 
the manifold S, such as shape, metric structure, etc., would play an important 
role in the study of neural networks. 

The metric structure We need to endow S with a Riemannian metric, 
which will be used to measure distances between points on S and angles 
between tangent vectors to S. 

Since S is a submanifold of R n of dimension r, it is natural to endow S 
with the natural metric induced from the Euclidean structure of R n . If 9i 
represent the parameters of the network (either weights or biases), then the 
basic tangent vector fields to S are given by partial derivatives with respect 
to coordinates 9{ as 

> _ % _ fdyi dy n \ 

ddi \doe'"'do l ) 1 l - l - r • 

The tangent space to S at y is the linear space, T y S — span{^; 1 < i < r}, 
generated by all basic tangent vectors at that point. If S is a smooth manifold, 
then T y S has a constant dimension r at each point y E S. This condit ion 
can be stated equivalently as the maximal rank condition 



Output Manifolds 


441 


which states that the basic tangent vector fields are linearly independent. 
This regularity condition assures that the output manifold S is smooth (no 
corners, cusps, etc.). A tangent vector to S at y, v E T y S , is dehned by the 
linear combination 


v — 



1 


where — Vi{9) are the components of v. 


The natural metric structure of S is provided by the hrst fundamental 
form with coefficients given by 


/ \_/ j- t \ _ (9y_\ T ^y_ _ v' dykdyk 

5*j(y) {&,£]) [ de J qq Z^QQ.QQ,- 

k —1 


(13.5.10) 


The r x r matrix g — (gij) can be used to compute lengths of tangent 
vectors, angles between directions, lengths of curves on <S, distances between 
points, areas of regions on <S, and in general, any mathematical concept that 
depends on the intrinsic structure of the manifold S. 

We recall that the concepts of intrinsic and extrinsic are often used in 
differential geometry to refer to the geometric information arisen from the 
local and global structures of <S, respectively. For instance, measuring the 
angle between two curves on S can be done using the local information, 
namely, it can be performed by a microscopic inhabitant of the manifold, 
who is not allowed to leave the manifold. On the other side, the training 
error of the network, which is the distance from an exterior target point z 
to 5, is an extrinsic concept, since it depends on the ability of a manifold 
inhabitant to fly above the manifold, in the exterior space, which allows him 
to make measurements. 

Length of vectors Consider a vector, v — Xa=i v ido~i tangent to the man¬ 
ifold S. We shall measure its length in two different ways: extrinsically, as a 
vector in R n , and intrinsically, as a tangent vector to S. 

If {e/e; 1 < k < n} denotes the natural orthonormal basis in R n , then the 
kth component of v in R n is given by 


r 

(v,e k ) = v T e k = 

i —1 



1 < k < n, 


where — (y, e&). Then the sqnare of the Euclidean length of v is given by 

n 

Eu = 

k =1 


V 



442 


Deep Learning Architectures, A Mathematical Approach 


The same length can be computed intrinsically using a scalar product 
with coefficients gij as in the following: 


v 


2 

9 


n 

52 ViV j9ij = 52 r ' r .' 52 

i,j ij k =1 


dyk dy k 

dOi 86j 




)= 52 ( u > e *) 2 

fe=i 



2 

Eu' 


The fact that we obtained equal lengths in both cases was expected, since the 
metric gij of S is indnced from the space W 1 and the length is independent 
of the intrinsic or extrinsic approach. 

Length of curves Assume the weights and biases of a neural network depend 
on an extra parameter s. This can be either time, or a certain hyperparameter 
of the network. Then 6i = 0(s), 1 < i < r, and hence c(s) = y(9(s)) represents 
a curve on the manifold S. Therefore, the continuous tuning of a network 
hyperparameter corresponds to a curve on the manifold. If s takes valnes 
between a and 6, then the length of the curve c(s) is dehned intrinsically by 
the integral 


/ b rb 

|c(s)||ds= / y 'cj(s)cj(s)gij(c(s)) ds, 

Ja i,j 

where c(s) represents the tangent vector along the curve. Chain rule provides 


c(s) = T~A S ) = 4zy(°( s )) = 52 = <V e y,0(s)). (13.5.11) 


ds 


ds 


86i ds 


Geodesics Sometimes, we are interested in hnding the curve of shortest 
length between two given points on S. If we look for the shortest curve 
between two points on R n , this is obviously a line segment. However, in the 
case of the manifold S the characterization is more complex, the curves of 
shortest distance being the geodesics. One application of geodesics is to find 
the shortest curve between a given initial point, y(0°), and the optimal point, 
y(0*), which is the orthogonal projection of target point z on S. This curve 
corresponds to the most efficient tuning of the network, since a parameter 
tuning corresponds to a curve on the manifold. 

It is worth noting that the distance between two points of S measured 
in the metric of S is at least as large as the distance measured between the 
same points using the metric of the target space. These distances are equal in 



Output Manifolds 


443 


z 



Figure 13.12: The optimum parameter , s*, corresponds to the closest point on 
the curve c(s) to the target z. 


the case when S is a geodesic submanifold, namely, its second fundamental 
form is zero, L — 0. 

Optimal parameter value Assume that while modifying parameters 9{s) 
in terms of s, we notice first an improvement in accuracy followed by a 
decrease in accuracy. The smallest error is reached for some optimal value 
s*. Geometrically, this corresponds to the point on the curve c(s), which is 
the closest to the target point z, see Fig. 13.12. This occurs when the vector 

zc(«4 is perpendicular to the tangent vector c(s ), fact that can be written as 

(z — c(s)) T c(s) = 0. 

If the parameter s is modihed such that the rate ||c(s)|| 2 is constant, 2 then 
differentiating and using product rule, yields c(h) T c{h) — 0. Hence, the pre- 
vious equation becomes as z T c(h) — 0. Using formula (13.5.11) implies that 
the optimal value s* satisfies the equation 

(V 0 (z T y(s*))y(s*)) = 0. 


13.6 Regularization 

The most desired property of a neural network is to generalize well This 
means that after optimizing the network using a training set, the network 


2 This is also called the arc length parameter, since it is proportional to the arc length 
measured along the curve c(s). 








444 


Deep Learning Architectures, A Mathematical Approach 


should stili perform with large accuracy for other unseen testing data. In 
order to achieve this goal, the network should be constructed such that it 
does not overfit the training data. 

This phenomenon is easier to explain if we consider the particular case 
of polynomial regression. Consider 7 points, (xi,Zi), 1 < i < 7, in the plane 
and use three types of polynomial models to perform regression. The linear 
regression is not a good model, leading to an underht of data, characterized 
by a large training error, see Fig. 13.13 a. The quadratio model produces a 
relatively small training error and is a good fit. The 7th degree interpolation 
polynomial produces an overfit, which is characterized by a zero training error 
and a large testing error. The way we should select the appropriate regression 
model is to be parsimonious when choosing the degree of the polynomial. In 
the same time, the polynomial should have a large enough degree to capture 
the main trend of data without overfitting it. 

Similar observations apply for the case of a general neural network. The 
approximation polynomial in this case is replaced by the output manifold 
S and the degree of the polynomial corresponds to the dimension of <S, i.e., 
the number of network parameters 6i. The principle of being parsimonious 
translates in this case in selecting a manifold of small dimension, namely, a 
network with a small number of neurons. 

The setup of the problem is as follows: Given N neurons and a training 
set 1 < i < n}, construet a neural network that learns from data, 

without overfitting it. 

We shall present next a few regularization techniques, that is, ways to 
avoid or reduce data overfitting, for a given number of neurons, N. 

13.6.1 Going for a smaller dimension 

Since N is given, we need to decide on the number of hidden layers and 
the number of neurons in each layer. We have seen that the use of only 
two hidden layers produce the maximum dimension for the manifold <S, so 
we should avoid this. We should either go shallow, with only one hidden 
layer, or go deep with a large enough number of hidden layers such that the 
dimension of S is sufficiently small, and the parsimony criterion holds. 

13.6.2 Norm regularization 

In order to reduce overfitting, smaller weights, w, should be used. The cost 
function, C(w,b ), is modified by adding an extra term involving a norm of 
the weights, multiplied by a positive Lagrange multiplier, A, which describes 
the preference for small weights (larger A corresponds to smaller weights). 


Output Manifolds 


445 




Figure 13.13: Polynomial regression through 7 points: (a) Using a line leads 
to an underfit; (b) using a quadratio polynomial leads to a good fit; (c) the 
use of a 7th degree polynomial overfits the data. 


The regularized cost function becomes 


L(w) — C{w , b ) + A11 w 


2 

5 


where || • || is usually either the L 1 or the L? norm. This type of regularization 
has been discussed in more detail in section 3.11. The effect of the norm 
regularization is to look for an optimal point on the output manifold, which 
is localized in a certain neighborhood. 


13.6.3 Choosing the flattest manifold 

We notice that there are several neural structures corresponding to a hxed 
dimension of the manifold S. This can be seen, for instance, in the encoder 
and decoder structures given in Fig. 13.11. The question is now, which struc¬ 
ture is better from the regularization point of view? 

In order to decrease the testing error, we shall go for the neural network 
for which the manifold S is as flat as possible. By “flat” we refer to a manifold 
that bends as little as possible in the target space R n . 

Flatness is an extrinsic concept that can be formalized in geometric terms 
by means of the second fundamental form. We shall start with a few examples. 

Example 13.6.1 Consider a plane V in the space M 3 . This is flat since it 
does not bend. This is equivalent to observing that the normal vector to the 
plane is a constant vector held. The way a surface bends is described by the 
rate at which its normal vector changes its direction. This is called the shape 
operator (or the Weingarten map) of the surface and it is related to different 
types of curvature, see, for instance, [85]. 

Example 13.6.2 Another example involves a plane curve, c(s), with unit 
tangent vector T(s) and normal vector 7V(s), where s denotes the are length 
parameter. The rate of change of the normal vector is given by N'(s) — 
—n(s)T(s), where k(s) is the curvature of c(s), which describes the bending 


























446 


Deep Learning Architectures, A Mathematical Approach 


of the curve. A zero curvature is equi valent to a zero rate of change of the 
normal, which corresponds to a straight curve, i.e., a line segment. 


Among all manifolds of the same dimension, in order to avoid overfitting, 
we need to choose the one which is as flat as possible. The second fundamental 
form , L, describes how is the manifold S curved inside the space R n , as 
described in section 13.1.7. 

We make this more explicitly in the particular case of the target space 
Ai — R n . For any two vector helds U — ^ k U k e^ and V = V ke k on R n 5 
the derivat ion operator V, which dehnes the derivative of V with respect to 
U is 


V u V = (D u V\...,D u V n ), 


where Djjf denotes the directional derivative of / with respect to U. Now, 
if consider U and V tangent vector helds to the manifold 5, then \/jjV can 
be decomposed orthogonally as 


VuV = (VuV ) 11 + (VtVg, 


where (V[V) is the projection of V uV on the tangent space of S, while 
(V u V)^ denotes the orthogonal component of V{/V. The normal component 


L{U,V) = (V u V) ± 


denotes the second fundamental form of S with respect to R n . The mapping 
L is symmetric and linear and can be written as 

r 

L(U, V)= J2 L *pU a VP : 

ct,/3=l 


where the Latin superscripts describe the dependence with respect to the 

dy (0) 


basic vector helds — 


d6 


a 


where U — U a ^ a . The coefhcients of the 


second fundamental form 


a=l 


L a (3 = 


are vector-valued belonging to the normal space to 5, which has dimension 
n — k. If L a p — then each component L^p forms a symmetric square 

matrix of order r. 


Vanishing L form A vanishing second fundamental form, L — 0, is equivalent 
to the vanishing of its coefhcients, L a p — 0. In this case, S is called a geodesic 
submanifold of the Euclidean space R n . The equivalent characterization is 



Output Manifolds 


447 


that any locally length minimizing curve in S is a straight line segment in 
R n . The manifolds with this property are the affine subspaces of R n , see 
Exercise 13.8.6. In this case the projection of the target z onto S is uniqne. 

The norm of the form L Since L is vector-valued, for regularization pur- 
poses we shall define and use a norm of L. For any vector U tangent to S 
at y we have that L(f7, U) is a vector in R n and let ||L(£/, U)\\e u denote its 
Euclidean length. We define the norm of L by 


L 


— max 


L(U,U) \\eu 

ivii 2 


U tangent to S 


(13.6.12) 


Here, \\U\\ denotes the length of U measured either in R n or using the metric 
on S. Using the scaling properties of L, see Proposition 13.1.8, part (m), this 
norm can be written equivalently as 


L 


— max 


: 1 


L(C/, U)\\e u ] U tangent to S 


This norm is related to the eigenvalues of L as in the following. 
to symmetry, L k a ^ = L k a , so that the matrix L k has real eigenvalues. 
expression 


max 

11011=1 


L k (U,U) | 



1 


Due 

The 


provides the absolute value of the largest eigenvalue of L k , 
Therefore, 



(A? + • • • + A 2 ) 1 / 2 . 


see Appendix G. 


From the geometric point of view each represents the curvature of the 
submanifold S into a certain normal direction, and hence, ||L|| represents an 
extrinsic measure of the curvature of S. By keeping ||L|| small, all curvatures 
are kept small and by this we can control how much S bends inside R n . 

The new cost function is the regularization of the square of distance using 
the previous norm 


C(w , 6; /i) 


y (w,b) — z || 2 + n\\L 


(13.6.13) 


where the hyperparameter /i is a Lagrange multiplier. Thus, the regulariza¬ 
tion process is obtained as a trade-off between two effects: the minimization 
of the training error (the distance from target z to S) and the maximization 
of the flatness of the manifold, see Fig. 13.14. The hyperparameter /i captures 
this trade-off effect: larger values of fi correspond to flatter manifolds, while 
smaller values of /i correspond to manifolds that pass closed by the target z. 


























448 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.14: Regularization with amanifoldS ofthe same dimensionbut differ¬ 
ent degrees offlatness: (a) Using apiane leads to a large distance from the target 
z, which underfits data; (b) using a trade-off between curvature and distance to 
z leads to a goodfit; (c) using a largely curved manifold we can always force the 
target point z to belong to the manifold, case which corresponds to an overfit. 


Example 13.6.1 (Polynomial regression) This example presents the case 
of a polynomial regression in the context of the output manifold concept. The 
polynomial of degree r 

6) — x r + 9\x r ~ l + 02X r ~ 2 + • • • + 6 r -\x + 9 r (13.6.14) 


is used to approximate in the least sqnares sense n points with coordinates 
given by 

T = {(xi, Zi), (x 2 , Z 2 ), (x n , z n )}. 

Using the training set T, the polynomial coefficients di shall be tuned snch 
that the sum of squares errors is minimized. We shall assume that the val- 
ues X{ are distinet and that r < n. We shall consider the vectors x T = 
(xi, X 2 ,..., x n ) and z T = Z 2 , ..., z n ) in R n . The input vector x and the 

parameter vector 9 — (9 1 ,..., 9 r ) G R r are used to construet the manifold S 
parametrized by 9 -G y(x; 9) G R n as 


y(x; 9) = C0(xi; 9), ..., i/j(x n ; 9)). 


Given the polynomial relation (13.6.14), the vector helds tangent to S take 
the form f j — Ylk =i X ]^ J e k- More specihcally, 


= 

6 


dy 

00 1 
dy 

00 2 




5 ' ' 





• 5 


• 5 


X 


r—1 

n 


X 


n 


) 

) 















Output Manifolds 


449 


Since Xi have distinet values, the following Vandermonde determinant does 
not vanish 


/ 1 1 ••• 1 \ 


det 


X\ 


X 2 




X 


r —1 r—1 


1 




X r 


X 


r —1 
n 


/ 


J_(T/ x i) ~f~ O5 

%<j 


and therefore, the maximal rank condition is satisfied, namely, rank 



= r. 


Thus, the vector fields {£ 1 ,..., £ r } span the tangent space, T y <S, of S at each 
point y E S. The intrinsic geometry of S is described by the metric tensor g 
having the components 



n 


<&t> = E 


k ^ 


r-J 

k 


k =1 


E 2 r—(i+j) , / • ■ / 

X k J \ 1 < 1,3 < r. 

k =1 


It is important to note that the matrix coefficients gij (and hence the intrin¬ 
sic geometry of S ) does not depend on parameters 9^. (A similar case occurs 
for the Enclidean space R n with the natural metric Sij). This corresponds to 
an intrinsically flat submanifold S. Since dgij/dOk — 0, it follows that the 
Christoffel symbols vanish, T- — 0. This implies that the Riemannian curva¬ 
ture 3 of the submanifold S is zero. In particular, its Levi-Civita connection 
vanishes on the base 


VfA = 

k 

Consequently, the second fundamental form can be written only in terms of 
the Levi-Civita connection of R n as 





= 0 . 

k=1 


because — (<^,grad<^) = 0, as the component — x r k ~ J is constant 

(depending on the bxed input entry x^). Therefore, L — 0 and hence the 
submanifold S is also extrinsically flat. The cost function 


C(9-g) 


y (M) 


— z 


2 


'The Riemannian curvature of a manifold is described by the tensor 



d e T 


r 

jk 


de 3 V 


r 

ik 


+ rr„r 


h 

ih L jk 




h 

ik j 


with summation over the repeated indices. 










450 


Deep Learning Architectures, A Mathematical Approach 


in this case does not need the regularization term /i||L||. 

In fact, the output manifold S is an affine hyperplane in R n of dimension 
r. The optimal solution is obtained by projecting the point z onto this hyper¬ 
plane. This projection can be computed explicitly using the Moore-Penrose 
pseudoinverse, see section G.2 in the Appendix. We shall write the condition 
constraints = Zj, 1 < j < n, as an overdetermined linear system 

x r j + 61 x T j~ x + • • • + 6 r -iXj + 6 r = Zj, 1 < j < n. 

This can be written in the matrix form as 

A6 = / 3 , 

where 0 T — ($i,..., 0 r ), /3 T = {z\ — x\, ..., z n — x r n ) and 



The optimal parameter, 0* can be obtained applying the pseudoinverse 

0* = A+/3, 

where is given by formula (G.2. 6). 

The projection of z onto the hyperplane S is given by y* = y(x; 0*). 

We shall discuss next the output manifold associated with only one neu¬ 
ron. The computation is complicated even in this simple case, while in the 
case of a general neural network it cannot always be performed explicitly. 

Example 13.6.2 We shall consider the case of a single sigmoid neuron with 
the input x G R, output y — cr{wx -j- b) G R, and real parameters w, 5, see 
Fig. 13.15. We take the activation function a to be the logistic function. If 
the training set is given by 

(x, z) = {(xi,Zl), (x 2 ,z 2 ),(x n ,z n )}, 

then the manifold S associated with the previous neuron is dehned by the 

map i/j : R 2 —> R n 

ip{w,b) = y(w,b) = (yi,...,y n ) = ( cr(wxi + b),..., a(wx n + b)). 

This represents a 2-dimensional surface included in the space R n , endowed 
with the Riemannian metric induced by the Euclidean structure of R n . The 
basic tangent vector fields to S are given by the partial derivatives as 

^ dy(w.b) 7X /.x 

Ci = —x- = cr (rex + i 0x = y0 1-y 0x 

ow 

& = ^ = ,V + &) = y0(i-y), 




Output Manifolds 


451 



y = cr(wx + b) 


Figure 13.15: The manifold S associated with a sigmoid neuron is 2- 
dimensional. 


where we used that o' — cr(l — a) and © denotes the Hadamard prodnct. The 
linear independence of {^ 1 ,^ 2 } is assured by the condition rank(<$;i,£ 2 ) = 2, 
which is obviously satisfied. The intrinsic geometry of the output manifold S 
is described in terms of the following metric coefficients 


5n = (?i> 6) = (y 0 (1 - y) 0 x, y 0 (1 - y) 0 x) 

= (y 0 y, x © x) - 2(y © y 0 y, x © x) + (y © y © x, y © y © x) 

512 = ( 6 , £2) = (y 0 (i -y) ©x, y © (1 - y)) 

= (y ©y, x) - 2(y ©y, y 0 x) + (y © y, y ©y © x) 

522 = ( 6 . 6) = (y o (i-y), y © (i-y)) = (y-y ©y, y-y ©y) 
= (y, y) -2(y ©y, y) + (y ©y, yOy) 

912 = 521 • 


All the previous formulas can be expressed in terms of sums of powers, 
for instance, 

(y Oy,x©x) = ^ylx 2 k , (y 0y,y) = (y©y,y©y) = ©©, 

k k k 

where yr — a(wxk + b ). 

Differentiating the coefficients g©-, one can potentially compute the 
Christoffel symbols of second type (13.1.2) and then obtain the Levi-Civita 
connection on S using a basis of tangent vectors as y)£fc. 

Since the computation is laborious, we shall proceed differently. 

With notation (#i,02) — (w,&), the Levi-Civita connection on the target 

space R n becomes = qq.qq. • Then the Gauss decomposition (13.1.5) 

writes as 


d 2 ijj / d 2 ^ \ II / d 2 ^ \-L 
dOidOj ~ \deidOj) + \d0id0j) 








452 


Deep Learning Architectures, A Mathematical Approach 


The second fundamental form is given by the normal part 


Lij — 


d 2 ^ \-L d 2 ^ 


ddiddj 


ddiddj 


d 2 ^ 

ddiddj 


(13.6.15) 


Both terms in the right side are computable. We shall start with qq.qq. . Using 
the relation a"(x) — cr(x)(l — cr(x))(2 — cr(x)), we obtain 


d 2 ip 

d 2 ip 

— — CT 

dOidOi 

d 2 w 


= y©(i- 

d 2 ip 

d 2 il> 

— — CT 

dd 2 d02 

d 2 b 


= y©(i- 

d 2 ip 

d 2 ip 

dd\ddz 

dwdb 


= y©(i- 

d 2 ip 

d 2 tp 

dd2ddi 

de x de 2 ' 


= <r // (rex + b) © x © x 


// 


— <j'\wx + b) O x 


The tangential component is a linear combination of basic vector helds 


d 2 ip 

dOidOj 


^ ^ij £l T 2 


where the coefficients o^- can be found explicitly as the solntion of the linear 
system 


gnot]j + gi2& 2 




9120^1 J + 922 




d 2 ip 

ddiddj 

d 2 ^ 

ddiddj 


£i) 


>£ 2 ) 


This is 


c^ij 9 


11, , 12, 


ddiddj 


> £ 1 ) + 9' 


ddiddj 


£ 2 ) 


2 „21/ d 2 '*/’ ^ \ , „22/ 


9 


ddiddj 


£1 > + <r 


ddiddj 


£ 2 ) 


where (g u ) is the inverse matrix (^) i . The scalar prodnct terms can be 



Output Manifolds 


453 


computed easily. For instance: 


d 2r i\) 

dOxdOx 


6) = (y O (1 - y) O (2 - y) © x o x, y o (1 - y) © x 


n 


- yk) 2 {z - Vkfxl- 


k =1 


The coefficients L{j can be now computed using (13.6.15). For instance, 


d 2 i(j /d 2r ^\W 
dw 2 \dw 2 ) 

= y 0 (1-y) © (2 - y) 0x0x - - « 1^2 

= y © (1-y) © [(2-y) 0x0x - a^x - a^]. 


Consider the norm 



sup \\L(U,U)\\, 
\m=i 


which is measures the flatness of S. The associated variational problem in 
this case is to minimize the regularized cost function 


C(w, b\ /i) = — ||y (w,b) — z|| 2 + fi\\L 


1 

2 


2_^(a(wx k + b) - z k ) + /x\\L 
k =i 


13.6.4 Model averaging 

A reliable technique for reducing the test error is to average the outputs of N 
different neural networks with the same input, which are trained separately. 
Each particular network makes an error e^. We shall assume the errors are 
independent random variables with zero mean and variance n, having the 
same distribution. In the virtue of the Central Limit Theorem (Theorem D.6.4 
of Appendix) the average error, e ave = ^ ^2iLi e ii ternis to be distributed 
normally, with zero mean and variance v/N, for N large enough. This implies 
that the method of averaging performs better than each of its members. 

This idea can be also formalized in the context of output manifolds. The 
main idea is to project the target z onto several output manifolds associated 
with some neural nets, and then consider the average of the projections as 
an approximation of the target z. For the sake of simplicity, we shall discuss 
this technique only for the case of two networks. 

Consider two feedforward neural networks of the same depth, having the 
same input x G and learning the same target z G R n . Let y = y(w, b) 
and y = y(w,b) be the outcomes of the two nets, see Fig. 13.16. They are 

















454 


Deep Learning Architectures, A Mathematical Approach 



Figure 13.16: Two neural nets with the same input , x, and outputs y(rc,6), 
y(fi;, 6), learning the same targetz. 


regarded as points belonging to an output manifold each, y G S and y G <S. 
After training, the weights and biasses are set equal to 


(re, b ) — arg min ||z — y 


(re, b ) — arg min ||z — y 


This means that y and y are the orthogonal projections of the target z onto 
the manifolds S and <S, respectively. The outputs average, ^(y + y), is an 
approximation of the target z, hopefully better than each y and y. 

However, we can do better than this by employing a convex combination. 
There are points on the line segment {Ay + (1 — A)y; A G [0,1]} that are closer 
to z than both y and y, see Fig. 13.17. The closest point is the projection 
of z onto this line. This corresponds to a better approximator of the target, 
which can be obtained as the output of only one network. This is the model 
combination of the previous two neural networks into only one net with the 
following properties, see Fig. 13.18: 

(i) the input is x; 

(ii) its depth is one unit more than the given nets; 

(iii) its £th layer is the union of the £th layers of the given nets; 

(iv) the last layer contains only one linear neuron; 

(v) its parameters are given by {re, 6, re, 5, A, 1 — A}; 

(vi) its outcome is Ay + (1 - A)y. 

The model combination can be applied to any nnmber of neural nets. The 
resulting net has the output given by a convex combination of the individnal 
network outcomes, with the coefficients chosen such that the distance between 
the target z and the affine space determined by the outcomes is minimum. 


13.6.5 Dropout 

Dropout is a powerful method to reduce overhtting, which works well for a 
large family of neural nets. The main idea is to drop or remove temporarily 










Output Manifolds 


455 


z 



Figure 13.17: The orthogonal projectiori ofz on the line segmentyy is a better 
approximator than both y and y. This is given by A*y + (1 — A*)y, where A* 
is obtained as A* = argmin ||z — Ay — (1 — A)y||. 



Figure 13.18: The model combination of two nets is a net that produces a 
better learning than both of its parts. 














456 


Deep Learning Architectures, A Mathematical Approach 


neurons (hidden, input, but not output) from the network. The choice of 
which neurons are removed is random. The success of this method consists 
in breaking the coadaptations^ formed among neurons by the Standard back- 
propagation algorithm. Dropout trains each neuron to be able to act without 
the help of other neurons. Conseqnently, the resulting network will generalize 
well to new unseen data, and hence, it will produce a smaller test error. 

Sometimes, dropout is explained more plastically by comparison with 
a company that adopts a training policy by which a certain percent of its 
workers are given a day off, while the rest of the workers are trained to 
perform the job of the missing ones. The workers picked to have the day off 
are randomly selected, even if the percent is kept the same. At the end of this 
training period each worker knows the job of the others and, consequently, 
the workers are able to perform more efficiently when the company faces a 
new, unseen, task. 

Dropout technique also resembles with the L 2 -regularization in the follow- 
ing sense. Since the dropout idea is to train neurons to act as independently 
as possible, this implies an indifference of the network to any specific feature. 
Since features are stored into weights, it follows that the weights System has 
to be shrank enough, fact similar with the case of a small norm. 

Dropping random neurons from a net, including their incoming and out- 
going connections, is eqnivalent to sampling a subnetwork, which is associated 
with an output submanifold. Training this subnetwork is equivalent to finding 
the projection of the target z onto the associated output submanifold. Apply- 
ing this process for several subnetworks produces estimations of the target 
by projections onto the associated submanifolds. Their average is taken as 
an estimator for the target. 

Dropping a hidden neuron Consider a neuron in the £th layer of a feed- 
forward neural network, with £ ^ {0, L}, i.e., the neuron belongs to a hidden 
layer. By dropping this neuron from the network, we understand removing 
the neuron together with all its weights (to and from the neuron), including 
also its bias. This will lead to a new neural network with the same input as 
the former one. 

The dimension of the parameter space of the new network is with + 

^h+i) _|_ i j ess than the dimension of the parameter space of the former net. 


4 This can be easily understood, for instance, if you try to recite the alphabet in the 
reverse order. The brain builds coadaptations when learning the alphabet in chronolog- 
ical order from A to Z. The difficulty faced when trying to recite the alphabet in the 
reverse order shows the existence of certain coadaptations formed among neurons during 
the learning process. 



Output Manifolds 


457 


This follows from the fact that we have removed incoming weights, 

rf(^ +1 ) outgoing weights, and 1 bias. As usual, denotes the number of 
neurons in the £th layer. Therefore, by dropping a neuron the network’s 
output depends on less parameters, which decreases the network capacity 
and reduces any eventual overfit. 

After training, the network output becomes the projection of the target 
z on an output manifold of a smaller dimension. It is not ciear whether this 
projection is closer to z than the former approximation applied before the 
neuron dropout. It is also not obvious which neuron dropout produces the 
best approximation. 

Dropping several neurons The dropout technique removes randomly a 
certain percentage of neurons from each layer and then takes the average of 
the resulting outputs, see Fig. 13.19. However, dropping too many neurons 
will decrease the dimension of the parameter space too mnch and, conse- 
qnently, it will lead to an underfit. 

When a certain number of neurons are dropped, the resulting associated 
output manifold is a submanifold of the output manifold associated with the 
initial network. The codimension of this submanifold is given by 


k = n w (d ( ^ 1} + d (m) + 1), 

e=i 

where is the number of neurons dropped from the £th layer. If the same 
percent, g, is dropped from each layer, then = qd^/ 100. 

After each dropout, the trained network output represents the projection 
of the target z onto the associated output submanifold. Each of these projec- 
tions is approximation of the target z. By an approach similar to the Monte 
Carlo method, the average of all these projections represents an approxima¬ 
tion of z, which diminishes the overht and is less prone to bias than any of 
the yj. 

Example 13.6.3 We shall exemplify the method using a neural network with 
one hidden layer, one-dimensional input and output, see Fig. 13.19. We con- 
sider N neurons in the hidden layer, and drop uniformly one neuron at a 
time, obtaining the following outputs 

yj = y — \jcr(wjx + bj ), j = i,..., N, 


5 The codimension of a submanifold S of a manifold M is the difference of their dimen- 
sions, k = dim At — dim<S. 



458 


Deep Learning Architectures, A Mathematical Approach 



y i 






Figure 13.19: When one neuron is dropped at a time, the network output 
produces projections of the target z onto the associated output manifolds. The 
average of projections, |(yi+y 2 +y 3 ), is supposed to be a better approximation 
of z than any of the y j. 


N 


where y = A jcr(wjX + bj ) denotes the initial network output. Since each 

3 = 1 

y j is taken with probability qi — q = then the expected network output 
is given by the average of the outputs as 














Output Manifolds 


459 


N 

E q -i y -i = 

3 = 1 





= (i - q) y, 


which is proportional to the output of the initial network, y. 


Multiplicative noise Dropout can be also seen as adding multiplicative 
noise to the network. Since each neuron is retained with the probability p, this 
means that the neuron’s output remains unchanged with probability p and 
vanishes with probability q — 1—p (i.e., it is dropped with probability q). This 
is equivalent to a multiplicat ion with a Bernoulli random variable. If the ith 
output of the £th layer before dropout is xf , then after dropout it becomes 

xf — Rf xf\ with Rf rsj Bernoulli (p), where the dehnition of the Bernoulli 
random variable can be found in the Appendix, section D.2. Therefore, the 
feedforward operat ion described by the master equation (6.2.24) 




M _ 


V-TM'’). 


1 < j < 


2=1 


in the case of dropout becomes 


x i 




g© 1 ^ ) 

<4 J2 - h f 


1 < j < d^ 


i —1 


where xf 11 = Rf l ' xf In the equivalent matrix form, the equation 
(6.2.29) 

= <j>(wW T xv-v - 


becomes 


X w = (p(w {e)T R ie ~ 1] ©:W _1) - BW 


where /V ^ = (Rf 1 ^) is a vector of independent Bernoulli random variables 
and © denotes the Hadamard product of vectors. 


Remark 13.6.4 Empirical evidence has shown that the optimal retention 
rate for hidden layers is usually p = 0.5, while for the input layer is about 

p — 0.8. 


460 


Deep Learning Architectures, A Mathematical Approach 


The next section establishes a relation between dropout and Z/ 2 -regularization. 

Linear regression with dropout This section deals with the application 
of dropout in the case of the classical problem of linear regression. Consider 
the input vector X G R n and the target zgK. We need to learn the weights 
vector w G R n such that ||z — Xw|| 2 is minimized. Applying dropout, the 
new objective function becomes 


/(w) — E[11 z — RQ Aw 


2i 


where R 1 — (i?i,..., R n ) is a vector of independent Bernoulli random vari- 
ables, Rj ^ Bernoulli(p). Using that ||a — b \\ 2 = ||a|| 2 — 2 a T b-\- ||fr|| 2 and the 
linearity of the expectation operator, the objective function becomes 


/(w) = 


2 

2 


2z T E[i2] © Xw + E[||i? © Xw|| 2 
— 2pz T Xw + p 2 ||Xw|| 2 + E[||i?©Xw|| 2 ] — p 2 ||Xw 

,,0 ,~,n, ^ „ 1,0-1 Oi, „ , O 


z — pX w 
z — pX w 
z — pX w 
z — pX w 
7. — n Aw 


T 

2 

2 

2 

2 


+ E[||i? © Xw|| 2 ] — p 2 ||Xw| 
+ E[w T (R Q X) T {R Q X)- 
+ E[R 2 } w T I T Iw - p 2 1|Xw 
+ pw T X T Xw - p 2 \\Xw 


wl — p 2 \Xw 






where we have added and subtracted the term p 2 11 A'w 11 ~ to form a square of 
the norm and used that the second moment of a Bernoulli variable is p. 
Absorbing the factor p into the weight w, the objective function becomes 


/ M 


z — Aie|| 2 + A||Aie|| 2 , 


which is an L 2 -regularization problem with the Lagrange multiplier A = 
and w — pw. When p tends to 1, all neurons are retained and A gets small. 
The constant A represents the ratio between the non-retained and the retained 
neurons during the dropout process. Hence, a linear regression with dropout 
is equivalent to a L 2 -regularization problem. 

Gaussian noise The idea of introdncing noise into a neural network in 
order to reduce overhtting works also for other types of noise. Srivastava 
et al. [116] describes a method of adding Gaussian noise to each neuron 
proportional to its activation. This means that the output of a hidden 
neuron is perturbed by a Gaussian noise proportional with the activation, 
i.e., to with G ~ A/*(0,1). The perturbation can be written 

equivalently in a multiplicative way as X^G', with G' ~ A/*(l, 1). It is worth 
noting that this new type of dropout works at least as well as the regular 
dropout involving Bernoulli random variables. 




























































Output Manifolds 


461 


In conclusion, removing neurons from each hidden layer and input layer 
of a neural network reduces substantially the dimension of the associated 
output manifold, leading to a decrease in the network capacity, and hence to 
a rednction of any overfitting effect. The reader can find more details in the 
paper [116]. 

Regularization by inserting noise One way to prevent neural networks 
from overfitting training data is to insert noise in the network dnring training 
and then average over the noise during testing. The noise can be, for instance, 
multiplicative or additive. In the case of a multiplicative noise, we multiply 
the outputs of each layer by a random variable R (Bernoulli, uniform, Gaus- 
sian, etc.). The network output, which depends now on both the input, X, 
and noise, 72, is given by Y = /^(X, i?), and becomes a random variable. At 
training time we find the optimal valnes of w for outputs of this noisy type. 
The optimal value depends on R as re* = w*(R). To remove the randomness 
we need to average over the random variable R at the testing time as 


y = f{x) = E R [f w (R)(x,R)] = j f w (r){x,r)p{r) dr, 


(13.6.16) 


where p(r) is the probability density of the random variable R. 

Formula (13.6.16) has more theoretical value than practical, as the inte- 
gral in the right side is difficult to compute exactly. In practice, we train the 
network for N instances of the random variable i?, given by ri,..., r/v, and 
obtain the optimal values of the weights as re*,..., w^. This means 


w n 




= argmin ||z - f w (x,ri ) 

W 


5 


where z = z(x) is the target function that needs to be learned by the network. 
At the testing time we consider the average of all N outputs, evaluating the 
expectation (13.6.16) by the following Monte Carlo formula: 


fi x ) = (13.6.17) 

i —1 

It is worth noting that the classical dropout technique is obtained as a par- 
ticular case of (13.6.17) by considering i? to be a Bernoulli random variable. 
This means that R takes the value 1 with probability p and the value 0 with 
probability 1 — p. A neuron activation multiplied by R — 0 is equivalent to 
a dropped neuron, while an activation multiplied by the value R = 1 is a 
retained neuron. Therefore, 100(1 — p) percent of neurons in each layer are 
randomly dropped, while lOOp percent of neurons are retained. 

Since multiplying a neuron activation by 0 is the same as assuming all the 
weights (ingoing to and outgoing from the neuron) vanishing, then (13.6.17) 
represents an average of outcomes of N trained subnetworks. 





462 


Deep Learning Architectures, A Mathematical Approach 


13.7 Summary 

This chapter discusses neural networks from a geometric point of view. An 
output manifold is associated with each network. The local coordinates on the 
manifold are the weights and biasses of the network. The output manifold con- 
cept is useful for understanding several aspects such as optimal weights, learn¬ 
ing process, overhtting and underhtting, as well as regularization techniques. 

The optimal weights and biasses of a network correspond to the coordi¬ 
nates of the orthonormal projection of the target onto the output manifold. 
Each learning algorithm changes coordinates on the manifold and corresponds 
to a curve on it. Endowing the manifold with a Riemannian metric enables 
the computation of curve lengths and also dehning the geodesic, which is the 
shortest curve between two points. The geodesic between the initial point 
and the projection point of the target onto the manifold corresponds to the 
most efficient learning algorithm. 

A target point which is too distant from the output manifold indicates an 
underht, while a target point which is too close, or on the manifold, represents 
an overht. 

Different types of regularization methods can be treated in terms of out¬ 
put manifolds. Going for a smaller dimension of the output manifold means 
to decrease the number of weights and conseqnently means to have less neural 
units into the network, which leads to a decrease in the network capacity. 

Choosing the flattest output manifold produces the least overfitting to 
training data. Model averaging chooses by means of minimizing distances 
a model with a better fit than any of its component networks. Dropout 
techniqne falis into this class and can be also seen as a multiplicative noise 
regularization method. A relation between dropout and L 2 -regularization is 
discusses. 

13.8 Exercises 

Exercise 13.8.1 A feedforward neural network of type 784-200-100-50-10 is 
used to classify the MNIST data. Find the dimension of the associated output 
manifold. (784 is the input size and 10 represents the number of digit classes). 

Exercise 13.8.2 Consider a neural network with input and output sizes 
given by and d ^, respectively. The number of hidden neurons is denoted 

by N. We assume there is an equal number of neurons in each hidden layer. 
Show that the number of hidden layers for which the output manifold has a 
maximum dimension is 

7 2N 

k — 


rf(°) + d( L ) + N' 



Output Manifolds 


463 


Exercise 13.8.3 A one-hidden layer feedforward neural net, 784-N-10, is 
used to classify the MNIST data. Find the range of the number of hidden 
neurons, iV, for which the network overfits the training data. 

Exercise 13.8.4 A two-hidden layer feedforward neural net, 784-/i-/i-10, is 
used to classify the MNIST data. Find the range of the number /i, for which 
the network overfits the training data. 

Exercise 13.8.5 Let u, u G T y S be two tangent vectors. Show that v and u 
are orthogonal in R n if and only if g(u , v) — JT ■ UiVjQij — 0. 

Exercise 13.8.6 A subset A of R n is called an affine subspace if for any 
two points A, B e A we have A A + (1 — A )B G A, VA G R. Let L be the 
second fundamental form of S with respect to R n . Show that the following 
are equi valent: 

(а) L — 0; 

(б) Any geodesic in S is a straight line in R n ; 

(c) S is an affine subspace of R n . 

Exercise 13.8.7 Let S be a submanifold of the manifold Ai. Show that the 
following are equivalent: 

(а) The second fundamental form of S with respect to Ai is zero, L — 0; 

(б) Any curve, which is a geodesic in 5, is also a geodesic in Ai. 

Exercise 13.8.8 Let c(s) be a curve on the output manifold <S, s G [a, b\. Its 
length and energy are defined, respectively, by 

rb i rb 

L(c) = / ||c(s)|| ds, £(c) — - / ||c(s)|| 2 ds, 

Ja 2 J a 

where ||c|| represents the length of the velocity along the curve in the metric 
structure of S. 

(а) Show that the length and energy of a curve are invariant under curve 
parametrizations. This is, if 0 : [c, d] —> [a, b} is a strictly increasing function, 
then the curve y(t) = c(0(t)) and c(s) have the same length and energy. 

(б) Show that T(c) 2 < 2(6 — a)£(c). When is the identity reached? 

(c) Let c w (s), \u\ < e, be a smooth variation of c(s), with co(s) = c(s). It can 
be shown that both variational equations 


du L (c u ) |«=0 0, 


d J{c u )| U=0 - o 



464 


Deep Learning Architectures, A Mathematical Approach 


can be written as 

c k (s) + X^r^(c(s))c l 00^00 = 0, 1 <k<n, 

hJ 

where c(s) — (c 1 (s),..., c n (s)). Furthermore, the previous equation repre- 
sents the zero acceleration equation along the submanifold S and can be also 
written as VcC = 0. What is the significance of these facts? 

Exercise 13.8.9 (a) Find the embedding curvature of the 2-dimensional unit 
sphere, 8 2 , in the 3-dimensional Euclidean space, R 3 . 

(6) Use part (a) to find the norm ||L||. Experiment with different sphere 
parametrizations. What do you notice? 

Exercise 13.8.10 Consider the model combination of two sigmoid neurons. 
Write the output of the combination and specify the dimension of the asso- 
ciated output manifold. 

Exercise 13.8.11 List a few effects of dropping neurons from a network on 
the associated output manifold. 

Exercise 13.8.12 For any two vector Lelds in R n 

U = Y / U k e k , V = Y,V k ek, 

k k 

dehne VjjV — ^2kU(V k )ek- Let / be a smooth function on R n . Show the 
following relations: 

(а) V fu V — fVjjV ; 

(б) V u fV = Utf)V + fV u V-, 

(c) VuV = V v U- 

(d) U (V, W) — (V[/U, W) + (U, V^/VF), where W is any other vector Leld. 



® 

Check for 
updates 


Chapter 14 

N eur omanifolds 


In this chapter we shall approach the study of neural networks from the 
Information Geometry perspective. This applies both techniques of Differen- 
tial Geometry and Probability Theory to neural networks. 

The difference from the theory introdnced in Chapter 13 is that here 
the network’s input and target are probability densities of random variables 
and the neural network output contains some noisy perturbation. This way, 
the family of joint probability densities of the input and output, p(x,y;0), 
becomes a statistical manifold, which is parametrized by 0; thus, the weights 
and biasses play the role of a coordinate System for the associated statistical 
manifold. The intrinsic distance between two neural networks is measured in 
this space using the Fisher information metric. Roughly speaking, the Fisher 
metric represents the amount of information about network’s own weights 
and biasses that is contained in the training distribution. The associated 
statistical manifold endowed with the Fisher metric becomes a Riemannian 
manifold, called a neuromanifold. 

In this chapter we compute explicitly the Fisher metric for several simple 
types of networks and present the natural gradient learning algorithm. The 
understanding of the Fisher metric leads to the characterization of shortest 
curves in the parameter space - the geodesics. This is important since each 
motion in the neural manifold corresponds to a learning process. The natural 
gradient is defined as the gradient computed with respect to the Riemannian 
metric induced by the Fisher information. 

The natural gradient descent method is presented, as a better alterna- 
tive to the usual gradient descent, which converges faster to the minimum 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_14 


465 



466 


Deep Learning Architectures, A Mathematical Approach 


of the cost function. The relation between the distance, curve length and 
energy in the parameter space, and the Kullback-Leibler divergence on the 
neuromanifold is also made. 

14.1 Statistical Manifolds 

First we shall recall some terminology from Information Geometry regarding 
statistical manifolds from the neural networks point of view. Consider the 
input to a neural network given by the random variable X. The output vari- 
able is Y — /#(X), where is the input-output mapping and 6 — (w, b) are 
the network parameters. The input and output distributions are denoted by 
Px{%) and py(y;0), respectively. The joint input-output distribution is indi- 
cated by p(x,y;9). Here, 6 has been included to suggest the density depen- 
dence on network parameters. 

The target is given by the random variable Z. The joint distribution of 
(X, Z) is the training distribution, p(x, z). Since the target is not replicated 
perfectly by the network output, we have Z = Y + e(0), where e{9) is an 
error term that depends on the network parameters. In the case when the 
mean square cost function is used as a loss function, we have to minimize the 
second moment of the error 

c{B) = l -n{z-Yf} = l -n<Bf}. 

Noisy neurons Information geometry is applied to neural networks in the 
context of noisy neurons. This means that in order to offset as much as pos- 
sible of the error e(9) we need to insert some noise in the network. The noise 
idea is not new. We have seen it in section 12.8.2 and in section 13.6.5, where 
adding multiplicative noise to the network leads to the dropout technique. 
Here, we consider an additive noise term, n, so the output of the network is 
given now by 

Y = f e (X) + n. (14.1.1) 

It is worth noting the role of the inserted noise played in regularization. 

Assuming the output one-dimensional, we shall consider two cases of noise 
terms: 

1. One possibility is to assume the noise to be a Standard normal random 
variable, n ~ A/"(0,1). Then, the conditional probability of the output, Y, 
given the input, X, is given in this case by 

i 

p(y\x;9) = _ e ~ 2 (y~fo( x )) m (14.1.2) 

v 27r 


This is based on the fact that E[X|X = x 


fo(x ), and VariY\X = x) = 1. 





Neuromanifolds 


467 



Figure 14.1: A statistical manifold S — {p(x,y;0);0 G 0}. 


2. Another possibility is to consider a uniform random noise between —1 and 
1, n ~ Unif[—1,1]. In this case the conditional probability is given by 

v(vlx . e)= f b if M x ) ~ 1 < y < M x ) +1 

’ 1 0, otherwise. 

The joint distribution of (X,Y) can be found now using the conditional 
probability formula 

p(x, y ; 9) — p(x)p(y\x\ 6). (14.1.3) 

The goal is to tune the parameter 6 such that p{pc, y; 6) matches as much as 
possible the true distribution p(x,z). This can be achieved geometrically as 
in the following. 

Statistical manifold The family of density functions, {6 — >► p(x, y; 0); 8 G 0}, 
parametrized over 0, can be considered as a submanifold of the infinite- 
dimensional space of probability density functions. Here, 0 denotes the param- 
eters space. The following regularity condition is also assumed: the functions 

d d 

—p(x,y,e),...,—-p(x,y,e) (14.1.4) 

are linearly independent, where 0 T — (6\,, Qpj) € 0. This condition assures 
that the submanifold is smooth and admits a tangent space at each point 
p(x, y; 6). The manifold S — {p(x, y; 9 ); 9} is called a statistical manifold , see 
Fig. 14.1. 

In this setup the training density, p(x, z), represents a point in the space 
of probability densities, which in general, is not situated on the manifold 5, 
see Fig 14.2. By tuning the network parameter vector, 9, we try to minimize 
the proximity between the given density and the corresponding manifold. 
The optimal parameter, if exists, is given by 



= argmin d(p(x, z),p(x , y; 0)) = argmin Dkl{p(x, z )\\p{ x > U 5 9)) 
6 0 











468 


Deep Learning Architectures, A Mathematical Approach 


where Dkl denotes the Kullback-Leibler divergence. We note that any other 
cost function can be considered, but the Kullback-Leibler divergence is pre- 
ferred dne to its relation to the maximum likelihood estimation. 

If the distance is zero, then the training distribution, y(x,z), belongs to 
the statistical manifold, <S, and the learning becomes exact, i.e., there is an 
exact valne of the parameter 9 such that y(x, z) — y(x, y; 9). 

The log-likelihood function for the statistical manifold S is defined by 

£{6) = £(x, y; 9) — lny(x, y; 9). 


In practice, the aforementioned distance is measured as a training error over 
the training data {(xi, zi),..., (x n , z n )}. This is given by the average of the 
negative log-likelihoods evaluated at {xk, z k) as 


Ctrain(^) 


n 

J2^x k ,z k -e) 

k =1 



n 

K lnp(x k ,z k ;9). 

k =i 


The optimal parameter, 0* = argmin# Ctrain(0)> is the maximum likelihood 
estimator, = 9 MSE . The relation between Ctrain(^) and the Kullback- 
Leibler divergence between the training and the model distributions has been 
pointed out in section 3.6 and is mainly based on the following argument 

Ct rain(0) = -E P ^ [£(X, Y ; 9)} 

= -E Pxz [lnp(X,Y]6)-lnp{X,Z)+ lnp(X,Z)} 

= KFXZ I 1 ” p P (X,Y^ ~ EPl2 hl[p(X ' Z)] 

= Dkl(p(X, Z)MX,Y;0)) - H(p(X, Z)). 


Since the Shannon entropy H(p(X, Z)) is independent of the parameter 0, 
then the optimal parameter 


6* = arg min C tra , in (6) = arg min D KL (p(X , Z) \ \p(X, Y-,6)) 

6 6 

minimizes the Kullback-Leibler divergence between the training distribution, 
p(X, Z ), and the statistical manifold S — {p(X, y]9)]9 G 0}. 

The aforementioned statistical manifold, S — {y(x,y;0);0 G 0}, can 
be endowed with a metric structure and will be regarded as a Riemannian 
manifold. We shall introduce this metric in the next section. 

It is worth noting that formulas containing target valnes z^ have an 
extrinsic character, since they access information regarding an exterior point, 
p(x,z), of the statistical manifold. Formulas involving x& and y& = /#(x&) 
have an intrinsic character, and hence they belong to the local geometry of 
the statistical manifold S. From this point of view, the aforementioned train¬ 
ing error, Ct Tdi [ n (9), is an extrinsic object. The next goal is to introduce the 
Fisher metric on a statistical manifold, which is an intrinsic object. 



Neuromanifolds 


469 


true distributiori 
; p{x,z) 


goodness of fit 


optimal distributio_n_ 
p(x,y;Q 


'i * 

4 0 


statistical manifold 



Figure 14.2: Geometric image of the goodness of fit (loss function). 

14.2 Fisher Information 

Assume that a probability density p(x ; 9) depends on the real-valued parame- 
ter 0, and we would like to estimate this parameter by an unbiased estimator 
9 — 0(x), which depends on data x. This means the average of the difference 
9{x) — 9 over all sets of data in the presence of parameter 9 is zero 

E[« (*) - «] = J(» (*) - «Mx; e) <fc = o. 

Denote, for simplicity, p — p(x;0), and differentiate the previous relation 
with respect to 9 using the product rule 

0 = -^rE \9(x) — 9] — [(9(x) -9)^- dx — [pdx. 
o9 I o9 

dp d ln p 

Since f pdx — 1 and —- = — p, the previous relation implies 

o9 o9 

0(x) -9)?f?pdx = 1. 

c)9 

A key feature is to split the density p into the product of two square roots 
and then rewrite the expression as 


/<91n p 


(' 9(x ) - 9)y/pj [-qq-Vp) dx = !• 




470 


Deep Learning Architectures, A Mathematical Approach 


Taking the square and using the Cauchy’s integral inequality, yields 

1 < f (6(x) — 6) 2 pdx f ( QQ^ ^) P^ x ' 


Each integral on the right side has the significance of an expectation. The 
first one, 

E[(0 — 6) 2 ] — J (6(x) — 9) 2 pdx , 

is the mean square error and the latter, 


1(6) = E 


8 lnp\ 2_ 
86 ) . 


dlnp\ 2 

-m-) pdx ’ 


is the Fisher Information with respect to 6. The previous inequality can be 
written now as 

E[(0 - 0) 2 }I(0) > 1 , 


or equivalently, as 


E[(^-e 2 ] > 


m' 


(14.2.5) 


This is called the Cramer-Rao inequality. It States that the inverse of the 
Fisher information is a lower bound for the mean square error. Therefore, 

A 

the minimum squared error estimator, 6 MSE , satisfies the identity 


mo MSE - 9 ) 2 ] = 


m' 


(14.2.6) 


Hence, when the information content is high, the error is low and vice versa. 
The Fisher information, /(0), represents the derivative content of the log- 
likelihood function 1(6) — ln p(x]6). The faster 1(6) changes with respect to 
0, the larger is the information 1(6) and the smaller the mean squared error. 
We shall see later that estimators that realize equality in the Cramer-Rao 
inequality are called Fisher-efficient. 

Fisher information is an assessment of the information about the unknown 
parameter 6 contained in the random variable X, which is modeled by the 
family of densities p(6). Hence, an estimator of 6 has the variance larger 
than the inverse of the previous information and it becomes efficient when its 
variance is the lowest possible. If there is little information about 6 contained 
in X, then Cramer-Rao inequality States that any estimation of 6 is loose, in 
the sense that it has a large variance. 


1 Sometimes, this is stated equivalently as Var{6) > 


i 

1 ( 6 )- 











Neuromanifolds 


471 



We also note that the Fisher informat ion can be written in terms of the 
likelihood function as 


1(6) = E 


dl(6)\ 2-i 

de ) 


This expression serves the purpose of a density function, p(x] 0), which depends 
on only one parameter, 6. The multidimensional case is treated in the following. 

The multivariate case A similar concept can be introduced in the case 
when the parameter 6 is n-dimensional, 6 — (#i,..., 6 n ). In this case we 
obtain the Fisher information matrix 


gij{6) = E 


dl{6) dl{9 )i 

dei dej r 


(14.2.7) 


where 1(6) — ln p(x;6), and the expectation is taken with respect to p(x]6). 
Other equivalent expressions are given in the following: 


Propositiori 14.2.1 The Fisher matrix can be also expressed as. 


9ij{0 ) = -E 


8 2 e(o) 

ldd^86j\ 


m m = 4 1 3y 'rn^ a JMK dx . 


oe , 


dOj 


(14.2.8) 


(14.2.9) 


Proof: Taking the derivative in f p(x; 6) dx — 1 with respect to 6, yields 


J do i p(x; 6) dx = 0, 


which is equivalent to 


fa tl 

Differentiating one more time with respect to 0j and applying product rule, 
we get 

= 0 

= 0 
= 0, 


d$ j d$ i ln p(x; 6) p(x ; 6) dx + J d$ t ln p(x; 6) dg j p(x; 6) dx 

d 2 £(6) 


E 


idOideji 


+ / d$ i ln p(x; 6) de i ln p(x; 0) p(x; 6) dx 

d 2 £(6) 


E 


L de i de a . 


+ 9ij{6) 



472 


Deep Learning Architectures, A Mathematical Approach 


which implies relation (14.2.8). 

In order to prove relation (14.2.9) we apply the following straightforward 
computationi 


gij{0 ) = / do i lnp(x;9) de 1 lnp(x;9) p(x;9) dx 


do t p(x-9) d 0j p{x;9) 


= 4 


= 4 


p{x\ 6) p(x\ 6) 
doiPjx] 0) dpjPjx ; 6) 
2 \/p(x]0) 2 y/p(x\0) 
d\fp{x ; 0) dyjp(x\ 6) 


p(x\ 6) dx 


dx 


d6i 


dOj 


dx. 


Relation (14.2.8) relates the Fisher matrix to the negative Hessian of 
the log-likelihood fnnction and has the following geometric interpretation. If 
£{x\9) — ln p(x\6) is the log-likelihood function corresponding to the obser- 
vation x and parameter 0, the maximum likelihood estimator of 6 is 


6 


MLE 


arg maxf(r; 6). 


Since there is a maximum at 9 — 9 MLE , then ^l{x]9 MLE ) — 0. Expanding 

A 

about 9 mle , we have 


£{x\ 9) 


£(x\ 9 


MLE 


1 02 

) + O 52 £ ( X ; e ML E )( e MLE,i 


+o(\\e MLE -ef) 


2 d9id9j 

hj 

3 




M L E, j 



The Fisher matrix, given by the expectation of the negative coefficient of the 
sum in the right side as in (14.2.8), measures how curved is the crest of £[x\ 9) 

clb U MLE ’ 

It can be shown that under the regularity assumption (14.1.4) relation 
(14.2.9) can be used to show that gij{9) is a symmetric, positive definite and 
nondegenerate matrix (see, for instance, Proposition 1.6.2 in [22]). 

Hence, the Fisher information matrix provide the coefficients of a Rieman- 
nian metric (see section 13.1.3) on the statistical manifold S — {9 —> p{pc\ 0)}, 
called the Fisher metric. This allows for computing lengths of vectors, angles, 
distances and areas on statistical manifolds. Sometimes, the Fisher metric 
gij(9 ) is considered on the parameter space, 0. 














Neuromanifolds 


473 


A natural question is what makes the Fisher metric distinguishable among 
all Riemannian metrics that can be defined on a statistical manifold? It can 
be shown that the Fisher metric has the following two properties, see [4] and 
[ 22 ]: 


1. gij is invariant under reparametrizations of the sample space. This means 
the statistical manifolds S — {p(x;9);9 G 0} and S — {p(h(x); 9)] 6 G 0}, 
with h invertible and differentiable function, have equal metrics, gij (9) = 
gij(9). This invariance property can be found in Theorem 1.6.4 of [22]. 

2. g^ is covariant under reparametrizations. This means that if we consider 
a different parametrization = £(0i,..., 9n) depending on 0, the Fisher 
matrices in both parametrizations are related by the relation: 


9ij(@) ^ ^ 9kr {Q 

k : r 


d£ k <9<f 
€=€(e) 89 l d9i ’ 


For a proof of this fact, see Theorem 1.6.5 of [22]. 

What makes the Fisher metric so special is that it is the only metric satis- 
fying the previous two invariance conditions. The proof of this distinguished 
resuit can be found in [28]. 

Cramer-Rao inequality There is a mnltivariate version of the inequality 
(14.2.5). The vector parameter is 9 — (#i, ..., 9jy) T G R^ and represents a 
coordinate System for the parameters space (0,g). Consider an estimator 

e(x) = 0 1 (x),...,e N (x)) T 

which is unbiased, E [9(X)] — 9. Then 

Cov(9{X)) > g(9)~ 1 . (14.2.10) 


This means that the difference matrix Aij — Cov(§i(X),9j(X)) — g 1 ^(9) is 
positive semidefinite for all 9 G 0, i.e., it has nonnegative eigenvalues. This 
inequality will be useful later when discussing about efficient estimators. 

For further applications of Fisher information in Science the reader is 
referred to [40], [41], and [39]. For applications to other type of neurons, see 
[ 121 ]. 


14.3 Neuromanifold of a Neural Net 

A neuromanifold is a Riemannian manifold associated with a neural network 
as in the following. Let y — fo(x) be the input-output mapping of a given 
neural network with input and output densities px(%) an d Py(v 5 ^) 5 and joint 





474 


Deep Learning Architectures, A Mathematical Approach 


density p(x, y \ 0), where 9 is a vector parameter consisting of all the weights 
and biasses of the network. The statistical manifold associated with the neural 
network is S — {p{x,y\Q)\Q G 0}. The Fisher metric can be dehned on S 
by formula (14.2.7), where we consider the log-likelihood function given by 
(,(x,y\9) — ln p{x,y\9). This can also be expressed as the following double 
integral: 



d£(x, y\ 9) di(x, y\ 9) 
d6i d9j 


p(x, y \ 9) dxdy. 


(14.3.11) 


Since g — gij is a Riemannian metric on 5, then ( S , g) becomes a Riemannian 
manifold. 


Definitiori 14.3.1 The neuromanifold associated with the aforementioned 
neural network is the Riemannian manifold (S, g), where S—{p{x ) y ; 6 ); 6 G 0} 
is the statistical manifold of the joint input-output densities of the neural net- 
work , 9 are the network weights and biasses, and g is the Fisher metric. 

Each joint probability density associated with a neural net, p(x, y; 0), can 
be regarded as a point on this manifold. The learning process, which is an 
adjustment of parameters, can be visualized as a curve on the neuromanifold. 

It is worth noting that the metric gij{9) is independent of the target 
values i.e., it is an intrinsic object. All concepts derived from the Fisher 

information will form the intrinsic geometry of the neuromanifold. 

The next computations will be performed under the assumption that the 
noise in formula (14.1.1) is Standard normal noise, n ~ A/"(0,1). Sometimes, 
one may consider a scaled noise, n ~ J\f(0 , s 2 ), and consider the Standard 
deviation, s, as a hyperparameter. The use of (14.1.3) yields the following 
log-likelihood function 

£(x,y]Q) — lnp(x) + lnp(y\x; 9) 

= ln p(x) - ln(\/2vr) - ^ (y - /<?(V)) 2 , 

with the partial derivative 

di(x,y-9) ( t 

= (143 - 12) 

The sensitivity of the input-output mapping with respect to a parameter, 

, is specific to each type of neural network and is a measure of complex- 
ity of each network. We shall compute it in a few particular cases and then 
provide a general recursive formula in the case of feedforward neural nets. 








Neuromanifolds 


475 


14.4 Fisher Metric for One Neuron 


In this section we shall provide explicit formulas for the Fisher metric in 
the case of a single neuron. We consider a neuron with the input given by 
the n-dimensional random vector X = (Xi ,..., X n ) T , input-output mapping 
fo(x) = <j)(w T x + b), with parameters 8 — (w, 6), and differentiable activation 
function </>. An application of the chain rule produces the partial derivatives 


dfejx) 

dw k 


Xk<p'(w T x + b ) 


dfejx) 

db 


— (j)\w T x + b). 


Then formula (14.3.12) provides 


d£jx,y;0) 

dw k 

d£jx,y;0) 

db 


Xk (y — (j){w T x + b))(j) r (w T x + 6 ), 1 < k < n 

(y — (j){w T x + &))<// {w T x + b). 


Since 8 — {wi ,..., w n , 6), the Fisher matrix is (n + l)-dimensional. Using 
(14.3.11) and changing the order of integration, we have 


£oo (w,b) 



d£(x,y-,8) 

db 


\ 2 

j p(x, y) dxdy 



(y — (j){w T x + 6 )) (j) r (w T x + b) 2 p(x)p(y\x; 8) dxdy 
(j) r (w T x + b) 2 p(x) (y ~ (j){w T x + b )) 2 p{y |x; 8 ) dy dx, 


Substituting p(y\x; 8) from (14.1.2) and changing the variable u — cj){w T x + b) 
yields 


£oo (w,b) 





(j)\w T x + b) z p(x) dx 


E Px [</>'(w T X + b) 2 } 


where E Px denotes the expectation operator taken under the law of input X . 












476 


Deep Learning Architectures, A Mathematical Approach 


Similarly, 


gok(w,b ) = 


Also. 



d£(x, y ; 6) d£(x, y ; 0 ) 


db 


dw k 


p(x ,y) dxdy 


Jj (i) — cj)(w T x + 6)) 2 (j)\w T 'x + 0) dxdy 

(j) r (w T x + b) 2 x k p(x) (y ~ (j){w T x + b)^j 2 p(y\x\ 6) dy dx 


i 




g jk (w,b) = 



<9f?(x, y; 0) <9f?(x, y ; 0) 




dw k 


p(x ,y) dxdy 


JJ {y — (f)(w T x + 6 )) 2 (\) (w T x + 6) 2 x J x/ c ^(x)^(y|x; 0) dxdy 

(f)\w T x + b) 2 XjX k p(x) (y ~ (j){w T x + 6)) 2 ^(?/|x; 0) dy dx 


=1 


= E Px [XjX k <p\w T X + b) 2 }. 


To conclude, the last three formulas can be written in only one formula as 


9ij 


ij(w) = E Px [. XiX j( f>\w T X ) 2 ], 0 <i,j <n. 


(14.4.13) 


where ur — ( w T ,b ) and X — (Xq, X), with Xq — 1 . In general, formulas 
(14.4.13) cannot be simplified any further. However, if Xi are independent, 
Standard normally distributed, and 6 = 0, then a closed-form solution exists 
for the Fisher matrix and for its inverse, see [6]. We shall develop this idea 
after we investigate a few particular types of neurons. 

Linear neuron In this case the activation function is (j)(x) — x , so replacing 
the derivative (j)'{x) by 1 in the previous formulas, yields 


£oo = 



gok{w,b ) = E Px [X k 


gjk{w, b) = E Px [XjX k ]. 


These formulas suggest that the Fisher matrix of a linear neuron describes the 
auto-covariance of the input vector X = (1, A). Furthermore, since gj k (w,b) 

do not depend on 6 = (w,b), then — 0, which implies vanishing Christof- 

fel symbols, Y^- — 0. Therefore, the neuromanifold of a linear neuron is 
intrinsically flat and all its intrinsic geometry is indnced only by the inputs 


covariances. 

















Neuromanifolds 


477 


ReLU neuron Consider the activation function (j){x) — ReLU{x ), which 
is piecewise differentiable. We shall differentiate it in the generalized sense, 
as in Appendix, section F.2. Then <//(x) — ReLU(x)' — H(x), see Exercise 
8.8.1. Since H 2 (x) — H (x), we obtain 


goo = E Px [ReLU' (w T X + b ) 2 ] = E Px [H{w T X + b ) 2 ] = E Px [H{w T X + b )] 
= / p(x) dx — E(w T X + b > 0) — P(X E R w ,b)- 

J{w T x+by 0 } 

Therefore, the coefficient goo represents the probability that the input vector 
X belongs to the half-space R Wi b — {w T x + b > 0}. We also note that 
0 < goo < 1. Its valne depends on the values of the hyperplane translation 
parameter b as 

lim goo = 0, lim poo = 1- 

6—^oo 6—oo 

The other metric coefficients are given by 


9ok 


= E Px [X k ReLU'(w T X + b) 2 } = E Px [X k H{w T X + b)} 


{w T x+b> 0} 


X k p{x) dx. 


9jk 


= E Px [XjX k ReLU'(w T X + b) 2 } = E Px [X j X k H(w T X + b )] 


{w T x+b> 0} 


xjXkp(x) dx . 


Using 0 < H{w T x -\-b) < 1, it follows that gok < E Px 
In fact, we have 


AA], gjk < E Px [XjXk\. 


lim gok = lim g jk = 0, lim g 0 k = E[X fe ], 

b —^oo b —^oo b—y — oo 


lim g jk = E[XjX k ]. 

6—oo 


Hyperbolic tangent neuron The activation function </>(x) — tanhx satis¬ 
fies (j)\x) — 1 — tanh 2 x. As usually, we denote for simplicity, t(x) — tanhx. 
The Fisher metric coefficients are given by 


9oo 

9ok 

9jk 


E Px [t'(w T X + b) 2 } = E Px [(1 - t 2 (w T X + b)) 2 } 

E Px [X k t'(w T X + b) 2 ] = E Px [X k (l - t 2 (w T X + b)) 2 } 

E Px [XjX k t\w T X + b) 2 } = E Px [XjX k ( 1 - t 2 (w T X + b)) 2 } 


for 1 < j,k < n. It is worth to note the inequalities 


0 < <?oo < 1 


90k 


<E Px [X k }, g jk < E Px [XjX k } 















478 


Deep Learning Architectures, A Mathematical Approach 


14.5 The Fisher Matrix and Its Inverse 


In order to compute the Fisher matrix (14.4.13) and its inverse some addi- 
tional conditions have to be assumed. We shall consider (Xi,...,X n ) ^ 
A/*(0,I n ), i.e., the input is a multivariate Standard normal random variable. 


Denote \w\ 2 — w T w = ^2 w ‘f and consider the functions 


;? 

V 


Ci(w,b ) = 


C 2 (w,b) = 


C 3 (w,b ) = 




1 

[ 4>\\w 


w 

2 V ( 2n J 



1 

[ 4>'{\w 


w 

2 V2tv J 



1 

[ 4>'{\w 


w 

2 V ( 2n J 


1 .2 


(14.5.14) 


(j) r (\w\e + b) 2 e 2 e 2 e de (14.5.15) 


1 .2 


1 ^2 


ej)'{\w\eb) 2 e e 2 e de. (14.5.16) 


Using that w T X — J^WiXi ^ A/*(0, \w\ 2 ), we can write w T X — 
e ^ A/*(0,1). Then we have 


w 


e, with 


goo(w, b ) = E [ef) r (w T X + b) 2 ] = E[0 / (|ie|e + b) 2 } 

J ej)\\w\e + b) 2 e~X de — \w\ 2 Ci(w,b). (14.5.17) 


y/2 


7r 


We compute next go/c — E [X^ej)'{w T X + 6) 2 ], 1 < k < n. Writing again 
w T X — \w\e, we have 

— E [w T X cf)'(w T X + b) 2 } = E[|re|e(/) / (|re|e + b) 2 ] 


k 


w 


\ph r 

|3 


J ec/)'(\w\e + b) 2 e X de 


— \w\ 6 Cs(wj b). 


(14.5.18) 


Let v be an arbitrary unit vector, orthogonal to w. Then 

XI SofcVfc = E [v T Xej)\w T X + b) 2 } = E[i; T X]E[(/) / (ie T X + 5) 2 ] 

k 

= E[e]E[(/> / (re T X + 5) 2 ] = 0, 

where we used = |i;|e = e ~ A/*(0,1), the fact that E [e] = 0 and that 
w T X and v T X are independent, see Exercise 14.13.3. It follows that the 
vector ((/oi ? • • • , gon) T is normal to all vectors v (which are perpendicular to 
w). Hence, (poi 3 • • • ,%n) T has to be proportional to re, i.e., there is A E R 
such that 

(/o/c — ^ w k, 1 < k < n. (14.5.19) 




























Neuromanifolds 


479 


To determine A, multiply by and take the sum 


E QokWk = A w 2 = A 


w 


from where 


A = 


w 


'}2 j gokWk = = \w\C 3 (w,b) 


where we used (14.5.18). Then (14.5.19) yields 

gok = w k \w\Cs(w, 5), 1 < k < n. 


Remark 14.5.1 We make the remark that if b = 0 and (j){x) — tanh(x), 
then Cs(w,b) — 0, because (j) r {\w\e) 2 is an even function in e. Consequently, 
gok = 0, 1 < k < n. 


We shall show in the following that the matrix 

g jk = E [XjX k (f)'{w T X + b) 2 }, 1 < j, k < n 

has the explicit form 

gjk = \w\ 2 Ci(w,b)5 jk + ( C 2 {w,b ) - Ci(w,b))wjw k , 
where Sjk is 1 for j — k and 0 otherwise. In equivalent matrix form, the matrix 

g = E [XX T cj)'{w T X + b) 2 } 


can be written as the following sum: 

g — \w\ 2 Ci(w, b)I n + (C 2 {w, b ) — C\{w, b))ww T . (14.5.20) 

For simplicity, shall denote the matrix in right side of (14.5.20) by h. Since 
both g and h are symmetric matrices, in order to show that g = h, we shall 
use Exercise 14.13.1. Therefore, it suffices to prove that w T gw = w T hw and 
v T gv — v T hv, for all unit vectors v normal to w. We shall do this in two steps: 

Step 1: Show w T gw — w T hw. The left side can be computed as 

w T gw — w T K[X X T (/)' (w T X + b) 2 ]w — E [(w T X) 2 (/)'(w T X + b) 2 ] 

M f e 2 

J 

w\ 4 C 2 {w, b). (14.5.21) 


(j) r (\w\e + b) z eX de 


— ¥\\w\ 2 e 2 (j)\\w\e + b) 2 ] 


The right side is 

'T' 

w hw — 


w\ 2 Ci(w, b)w T w + (C 2 (w, b) — Ci(w, b)^w T ww T w 

w\ 4 Ci(w, b) + (C 2 (w, b) — C\{w, b )) \w\ 4 

w\ 4 C 2 (w,b). (14.5.22) 

















480 


Deep Learning Architectures, A Mathematical Approach 


Since (14.5.21) and (14.5.22) agree, we have obtained the desired identity. 

Step 2: Show v T gv — v T hv. Using relation (14.5.17) and the fact that w T X 
and v T X are independent, see Exercise 14.13.2, the left side becomes 


T 

v gv 


v T E[XX T (P\w T X + bf]v - 
E [(v T X) 2 ]E[<f>'(w T X + b) 2 } 

w\ 2 Ci(w, b). 


E {{v T X) 2 {w T X + b) 2 } 
= E[e 2 ]E[0'(|«;|e + 6) 2 ] 


= 1 


z £00 


(14.5.23) 


For the right side we have the straightforward computation 

v T hv — v T (\w\ 2 Ci(w, b)I n + (C 2 (w, b ) — Ci(w, b))ww T ^v 

— w\ 2 Ci(w, b)v T v + (C 2 (w, b) — Ci(w, 5)) (v T w) 2 

— w\ 2 Ci(w,b), (14.5.24) 

where we used the orthonormality conditions v T v — 1 and v T w — 0. Since 
(14.5.23) and (14.5.24) agree, we proved the desired identity. 

To conclude, the Fisher matrix is given by the following (n+1) x (n + i) 
matrix 


9 : 



w\ 2 Ci(w, b) 

9ko = w k \w\C 3 (w,b ) 


1 < k < n 


w 


Ci(w,b)5 jk + (C 2 (w,b) - Ci(w,b))wjw k , 1 < j, k < n. 


Finding the inverse of g It is easier if we write the Fisher matrix as 



( 501 • • • gon \ 

5io 

: 9 

\ 9n0 / 


The (n x n)-block g = gij, 1 < i, j < n, is invertible in closed form. We shall 
look for an inverse of a form similar to (14.5.20) 

g~ l = p\l n + p 2 ww T , 


and determine the functions p\ and p 2 in terms of w and 5, such that gg 1 —I n , 


Using ww T ww 1 — w 


,T 


W 


2 W T — 


W 


ww T , an algebraic computation provides 


-i 


gg * = (w\' z CiI n + (C 2 -Ci)ww T )(piI n + p 2 ww 1 ) 

w\ 2 CipiI n + [pi{C 2 — C\) + \w\ 2 p 2 C 2 ]ww T . 


.T 



















Neuromanifolds 


481 


By coefficient Identification, we ask 

w\ 2 Cipi — 1 

P\{p 2 — Ci) + w\ 2 P2C2 — 1 


with Solutions 


Pi = 


P 2 


rc| 2 Ci ’ 

Therefore, the inverse of g is given by 


w 


C 2 Ci 


9 1 = 


1 ^ 1/1 

'In + 


re 


4 VC 2 C\ 


1 \ T 
jww , 


rc| 2 Ci 

This resuit appears as Theorem 4 in [6]. 

When inverting the matrix g we consider two cases: 
Case 1: b — 0 and (f){x) — tanh(x). In this case 

1 


0 ) = 


w | 2 \/27r 


J (f)'{\w\e) 2 e e 


1 ^2 

2 e de = 0 


since ^(Irele) 2 is an even function. It follows that g^ = ^0 
and the matrix can be inverted block by block as 




1 


9 1 = 


900 

0 


0 


0 \ 


V 0 


9 ' 


1 


/ 


with g 1 given by (14.5.25). 

Case 2: The general case. We shall indicate how to compute g 
way. First, we decompose g as a sum of two matrices 


9 = 


/ 0 noi • •• 90n ^ 


( 9oo 0 • • • 0 \ 

5io 


0 


+ 


: O n 


: 9 


/ 


/ 


The matrix A 2 is invertible, with the known inverse 




1 


Ar, 1 




0 


0 \ 


V 0 


9 


1 


/ 


(14.5.25) 


, for 1 < k<n, 


1 in an iterative 


Ai + A 2 . 



































482 


Deep Learning Architectures, A Mathematical Approach 


The inversion of a sum of two matrices is covered in Appendix, section G. 
Using the expansion method, we have 

r 1 = (4i + A2)- 1 = Ag 2(-l ) k (AiAg) k . (14.5.26) 

k> 0 


Since the prodnct 


/M 2 -' 


/ 0 £”_1 gai g > 1 

9io 

\ 9n0 


E".i s«,s ln \ 


o 


n 




is a sparse matrix, its powers are not costly to compute. 


Another iterative method to hnd the inverse is to construet the sequence 
{9n l )n >o defined recursively by % 1 = O n , V+i = /On 1 )* where f(M) = 
A~g — MA\A^ 1 is a contraction. The sequence g~ l tends to the fixed point 
of the mapping /, which is the inverse g~ l . 


Convergence conditions Series (14.5.26) and the sequence gif 1 converge 
provided some conditions are satished. Following section G of Appendix, 
the required condit ion is \A\Af 1 | < 1 


where the norm is the value of the 
-i 


largest eigenvalue. We shall show that A\A 2 — goo and that the condition 


AiA 2 1 11 < 1 is satished by some familiar classes of neurons. 


Lemma 14.5.2 Let a, b G M n be two vectors such that a T b > 0. Then the 
eigenvalues of the matrix 



a\ 




\ 





are \\ — (aFb) 1 / 2 , A 2 = — (aFb) 1 / 2 , A j — 0 ; for all j > 3. 

Proof: See Exercise 14.13.4. ■ 

We let M — AiAf 1 and show that a T b < 1. We have 


a T b = g 10 Y 9oji9 nl H-l~ 9n0 Y 9oj n 9 JnU 

jl jn 

- YY 9 P o9oj r 9 JrP = Y 90jr Y 9p09 JrP 

p j r j r p 

= 5oo- 














Neuromanifolds 


483 


If the activation function satisfies 110 7 11 oo < 1 a.e. (namely, its steepest slope 
is less than 1 almost everywhere), then 


1 .2 


#00 = 


y/2 


7r 


(j)\\w\e + b) 2 e 2 e 


de < 


V2 


e 2 e de — 1. 


7r 


Several activations functions, such as tanh(-) or logistic function cr(-), sat- 
isfy the aforementioned property. Hence, in these cases, goo < 1 and the 
convergence condit ion is satished. 


14.6 The Fisher Metric Structure of a Neural Net 

Even if we cannot hope for explicit formulas for the Fisher metric in the case 
of a feedforward neural network, however, we can obtain the metric structure 
by an iterative method that is similar to the backpropagation method. The 
computation is stili performed under the assumption that the noise inserted 
into the network is Standard normal, n ~ A/*(0,1). Even if this modeling 
assumption seems to be limited, we consider it for the sake of simplicity. 
Other types of noise can be considered, but the computation will not run as 
smooth. 

Denote C(x,y\0) — ^(fe(x) — y) 2 . Then relation (14.3.12) writes as 

d£(x, y ; 6) dC(x , y\ 6) 

90 k d0 k 

This is equivalent to Vq£(x, y ; 9) — —VqC{x ) y ; 6). If now, we regard C(x, y; 6) 
to be a quadratio cost function (even if in this context it has a different sig- 
nificance), we can compute the gradient X7oC(x,y]9) by the backpropaga¬ 
tion method presented in Chapter 6. The parameter 9 k will be replaced by 
and b^\ respectively. Following the notations and the computation from 
Chapter 6, we obtain 


d £(x,y;d) 

dC(x,y\d) 

dC(x, y, 6) dsf 

dwf 

dwf 

dsf dwf 

— 

°j x i 


d£{x,y,d) 

dC(x,y;0) 

dC(x, y, 9) dsf 


•> i 5 U i w J \ . y, w j j 

dbf dbf dsf dbf 

= 7’ 


5 

















484 


Deep Learning Architectures, A Mathematical Approach 


where represents the sensitivity of C(x,y,9 ) with respect to the signal 
sf and can be computed using the backpropagation formula (6.2.22) 


8) 


(t-i) 


= M 


( ]( e ) 
3 = 1 


(£) (£) 
w ij 


(14.6.27) 


The delta in the last layer is computed as 


* 


( L ) 


dC 


d 


ds) 


( L) 


ds) 


( L) 


- yj' 


fisfmsf) - y) 
0'( s( n L) )(fe(x) - y). 


(14.6.28) 


In order to overcome the writing difficulties of the expression of the Fisher 
matrix, we use notations ol — a' — (i*X q — 1 , and w\yj — b^\ 

Then 6n> — and we obtain the metric coefficients 


13 


9aa' 


= E PxY 


= E PxY 


d£(x, y, 0) d£(x, y, 0) 


de 


OL 


de. 


OL' 


= E PxY 


d£(x, y, e) d£(x, y, 9) 


dw fj 


dwfji 


i i' 3 j' 


with the indices in the following ranges 


0<i,i'<S e 1 < j, j' < , 1 < t < L. 


where L represents the depth of the network. The expressions of deitas Sj 
and Sj f ^ are obtained by the backpropagation formula (14.6.27). 


(t) 


If £ = (! — L, using (14.6.28) the expression of the metric coefficients 
becomes 


Saa 1 


£=£' = L 


= E PxY 
= E PxY 








If the activation function in the last layer is linear, (j){x) — x, the expression 



















Neuromanifolds 


485 



Figure 14.3: One-hidden layer neural network with activation function (j) and 
input-output mapping fo(x). The activation in the output neuron is linear. 


is more simple and can be computed as in the following: 


9ota' 


= E PxY 


£=£' = L 


X (L - 1} 4 l - l \f e {X)-Y) 



xf l \fe(x) - y) 2 p(x,y)dxdy 


x\ L 1} 4 L ^Pix) I (fe(x) - y) 2 p(y\x;0) dy dx 


,(L-1) (L-l) 


x- 'x-i p(x) dx 


= E Px 


X (L-1) X (L-1) 


where we used that 


(. fe(x ) -y) 2 p(y\x-,d)dy = Var(n ) = 1 


namely, the variance of the noise n ~ JV( 0,1) is 1. The layer activations 
x\ L ^ are computed iteratively by the forward pass formula 


- b\ i} ). 


W 


We shall compute explicitly the Fisher metric in the following concrete 
case. 


Example 14.6.1 (Fisher metric for a one-hidden layer network) We 

shall consider the case of a feedforward neural network with one hidden layer 
and one-dimensional input and output, see Fig. 14.3. The activation function 







486 


Deep Learning Architectures, A Mathematical Approach 


in the output neuron is linear, while in the hidden layer is denoted by (j). The 
input-output mapping is given by 


fe(x) = ai<J)(wiX + bi). 


1=1 


A straightforward computation provides the partial derivatives with respect 
to parameters 


dfeix) 

doij 

dfejx) 

dwi 

dfeix) 

dbn 


— cf)(wjX + bj) 

= aix(j)\wix + bi) 
= OLi<j) f (WiX + bi). 


The partial derivatives of the log-likelihood functions are 


da.j 


dC 

doij 

= ~ife(x) - 


1 d 

2 dotj 


ife(x) -y) 


dfeix) 

dcij 


= ~(fe{x) - y)f(wjx + bj) 


di 

dWn 


dC pfe(x) 

= ~aix{f e (x ) - y)4>'{wiX + bi) 


di 

dbo 


ac 

dbn 


= -aiife(x) - y)(f>'{wiX + bi). 


Then coefficients of the Fisher matrix in directions ay, can be computed as 

r di d£ i r 

g aj a k = E PxY [——\=E p ^[<l>iw j X + b j )<l>(w k X + b k )(MX)-Y ) 2 
= J <p{wjX + bj)4>{w k x + b k )p{x) J ifo{x) - y) 2 p{y\x ; 6) dy dx 
— f cf)(wjX + bj)<f>{w k x + b k )p(x) dx 


= E Px f(wjX + bj)4>(w k X + b k ) . 



Neuromanifolds 


487 


Similarly, 


QwiWj — 


= E PxY 


di di i 


= E PxY 


- dwi dwj . 
2 // 


OLiOLjX^iwiX + bi)c/)'(wjX + bjHMX) - ry 


— Oiidj I x 2 (j) r {wiX + bi)(j) r {wjx + bj)p(x) l(fe(x) — y) 2 p(y\x ; 6)dy dx 




= 1 


OiiOij 


;E Px 




Also, by similar manipulations we obtain 


9bibj 


9oijWi 


Qcijbi 


= E PxY 


di di 


. dbi dbj. 


= ctictj E Px 


4>'(wiX + bi)(f>'{wjX + bj) 


= ai E Px 
= a,- E Px 


X<P( Wj X + bj)<t>'{wiX + bi) 


4>{wjX + bj)(f>'(wiX + bi) 


9u>ibk bX{OLk 


E Px 


X</>'(wiX + bi)<l>'(w k X + b k ) 


We note that all the above coefficients depend on the input density, p(x), the 
activation function of the neurons in the hidden layer, 0(x), as well as the 
parameters of the network. 


14.7 The Natural Gradient 

In order to minimize the cost function, C(w,b), which depends on weights 
and biasses, the gradient descent method was employed. This involved tak- 
ing a step y > 0 into the direction of the negative gradient, — (V W C, V&C). 
This gradient is computed using the flat geometry of the parameter space 0 
indnced by the Enclidean metric 5ij. The idea of this section is to apply the 
gradient descent method on the coordinate space 0 but with a gradient com¬ 
puted with respect to the Fisher metric. This method is desirable because it 
converges faster to the optimal parameter valne, 0*, since the steepest direc¬ 
tion is not captnred by the Euclidean gradient, but by its natural gradient, 
as it was pointed out in Amari [6]. This section will introduce this concept 
and present its main properties. 

Let ( S — {p{x,y\Q)\Q G 0},p) be the neuromanifold associated with 
a given neural network. The parameter space 0 can be endowed with the 
























488 


Deep Learning Architectures, A Mathematical Approach 


denote the 


metric g(9) induced from (S,g) as in the following. If 

^ ) i<^<w 

coordinate vectors on 0, then it suffices to define the metric on this basis as 


d 

de 4 


d d 


9 


d6 l ’ dOj 


— 9 ij{@)- 


Thus, the parameter space together with g{6) becomes the Riemann manifold 

( 0 , 0 ( 0 )). 


Consider a smooth function dehned on the parameters space, /:©—>> R 
(in particular, this can be any cost function). The multidirectional change of 
/ is described by its gradient. The Euclidean gradient is the vector field 


N 


df 


VeJ = E w k ek = 


k =1 


df_ 

de 


i 


diy 

dO N ) 


where {ek}k is the natural orthonormal basis in W N . This type of gradient was 
very useful in the classical gradient descent method presented in Chapter 4. 
However, in the case when 0 is endowed with the Fisher metric g{6 ), the 
gradient of / has to be adjusted accordingly. 

The natural gradient of / is the gradient taken with respect to the Fisher 

metric g(9). This can be written using the basis of coordinate vectors d 
as 

N 8 
= (14-7.29) 

k =1 


de 


k J k 


dO k 


N 


df 


with components given by (V g f) k = E g kj (0) oZ - ? where g k i (6) are the coef- 

(y(7 n 

j =i 3 

ficients of the inverse matrix, g~ l {9). An equivalent formula for the natural 
gradient (14.7.29) in terms of the Euclidean gradient is 


Vgf = gdr^Euf- 


(14.7.30) 


As an application, the magnitudes of the Euclidean and natural gradients 
with respect to the Fisher metric are related by 

\\^ g f\\ 2 g = (v Euf) T g~ l {e)v Eu f 
W^EufWl = (V Euf) T g(0)v E uf, 

see Exercise 14.13.12. The multiplication by the matrix g(#) -1 in formula 
(14.7.30) rotates and scales the Euclidean gradient VeuI to obtain the natural 
gradient V g f. Since the gradients \7 g f and Ve u / vanish at the same value of 











Neuromanifolds 


489 



Figure 14.4: The natural gradient descent arrives faster to the minimum than 
the Euclidean gradient descent method does. 

0, see Exercise 14.13.12, it follows that both variants of the gradient descent 
method arrive to the same minimum /(0*) starting from the same initial 
point f{6 o), but on different paths, see Fig. 14.4. Replacing the Euclidean 
gradient by the natural gradient in the gradient descent method improves 
the efficiency of the method. The next section deals with applications of this 
concept. 


Remark 14.7.1 It is worth noting that the Euclidean gradient of the log- 
likelihood function can be used to represent the Fisher matrix g{6) = (cjij («o,, 
as 


g(9) = [X E J(X, Y- 9 )(V e J(X, Y; 0)) t } = E F «- [X E J (X Eu i) 




14.8 The Natural Gradient Learning Algorithm 


We have seen in Chapter 4 that the steepest direction of a cost function, 
C($), defined on a Euclidean space is given by its Euclidean gradient, \7euC . 
This resuit is not valid in the case of a cost function, C(0), which is defined 
on a curved space, such as the Riemannian manifold, (0,g). The steepest 
direction in this case is realized in the natural gradient direction, V^C. This 
section deals with the effects of using the natural Riemannian gradient in 
neural learning and it is based on the work of Amari et ah, see [5], [6], [99], 
[130]. 















490 


Deep Learning Architectures, A Mathematical Approach 


The steepest descent direction We start by considering a unit vector 
held, V — ^V l (0)-^, with \\V\\ g = 1 , tangent to the parameters space 
( 0 , < 7 ) and investigate the change of the cost function, 0(9), in the direction 
of V. The rate of change of C(9) with respect to V is denoted either by V ( C ), 
or by and is equal to 

pn pn 

— = V{C) = ^ V — = (V,VeuC). (14.8.31) 

i 

By Exercise 14.13.11, part (c), we have 


(V.W Eu C) =g(V 1 V g C). 


Using Cauchy-Schwarz inequality together with equation (14.8.31) yields 


dC 

dV 


g(V,V 9 C)<\\V\\ g 




gi 


with equality for the case when V and \ / g C are proportional. Therefore, the 
rate is highest in the direction of V = \7 g C/\\S7 g C\\ g . 

Therefore, the steepest descent direction of the cost function C(6) is the 
negative natural gradient, which is given by 


-v g c(9) = -g-\e)v Eu c{e). 


For a proof variant involving Lagrange multipliers see Exercise 14.13.9. 

It is worth noting that in the particular case when (0, g) is the Euclidean 
space, (R n ,4^), then g~ x — I n , and hence we retrieve the direction of the 
Euclidean gradient. 

The natural gradient learning algorithm introduced in [ 6 ] updates the 
parameter 6 n by the rule 


0n+i = e n - Vn \7 g C(e n ), (14.8.32) 

where the learning rate g n —> 0 as n —> oo in a certain way. 

It was suggested in [130] that replacing V EuC by V g C helps with the 
elimination of situations when the iteration is being trapped in a plateau. 
There are also other reasons why the natural gradient learning is more effi¬ 
cient than the usual gradient descent. Before getting to them, we recall ffist 
two types of learning algorithms, batch and Online learning. 

Batch learning In this case all training examples in a batch are used to 
obtain the optimal weight vector. If the training set is {(xi, z\), ..., (x n , z n )}, 
then the cost function depends on all samples as 

Ti 

c ( 0 ) = A^2 \ z j - 













Neuromanifolds 


491 


If data is sampled from the same training distribution, pxz{9), the cost can 
be also written as an expectation 

C{9) = l -E p xz^[{Z - f 0 (X)) 2 ]. 

The regular gradient descent method in this case is described by taking steps 
in as 

9 n+ i = 9 n - rj n VC(0 n ), n = 0,1,2 ,.... 


Online learning This uses each example only once, at the observation time, 
assuming that the examples are given one at a time. The cost function takes 
the simple form 





fe n (Xn) 


2 


The gradient descent method proposed in [3] and [107] employs the rule 


^n+l — 9 n C(^X n , y n , Oji) . 


In general, the convergence of 9 n to the true minimum, 0*, of the cost 
function is more accurate in the case of batch learning rather than in the 
case of the Online learning. However, if the learning rate, rj n converges to 0 
in a certain manner, and the Euclidean gradient, VC(x n ,y ni 9 n ), is replaced 
by the natural gradient, then the online learning becomes asymptotically as 
efficient as the batch learning. In order to present this idea further, we shall 
introdnce hrst a few notions regarding estimators. 


Typ es of estimators Let S — {po ; 0 E 0} be a family of densities and 6 — 

A 

9(x i, ..., x n ) be an estimator of the parameter 6 based on data {x±, . .., x n } 
sampled from the distribution pg. Then: 

• 6 is called unbiased if [6] — 9. 

A A /A. 

• 9 n = 9{x i,...,x n ) is called consistent if 9 n -A 9 in probability, i.e., 
lim n ^ QO P(\9 n — 9\ < e) = 1, for any e > 0. See Appendix, section D.6.1, for 
the defmition of convergence in probability. 

• 9 is Fisher-efficient if it is unbiased and reaches the lower bound in the 
Cramer-Rao inequality 


Cm){d)>g-\Q), V0E0, 


i.e., it is a minimum variance unbiased estimator. 


2 If A and B are two square matrices, we write A > B if A 
i.e., ali its eigenvalues are nonnegative. 


B is positive semidefinite, 






492 


Deep Learning Architectures, A Mathematical Approach 


• 9 n is asymptotically Fisher-efficient if it attains equality in the Cramer- 
Rao bound asymptotically, i.e., 

lim Cov(9 n ) = g~ 1 (9 ), V0 E 0. 

n—^ oo 

For instance, in a correctly specified model, a well-known resuit States that 
the maximum likelihood estimator, 9 MLEN , depending on N independent 
samples xj, 


A 


e 


MLE,N 


arg min — 

& e N 


N 

E ln p e (xj) 

3 = 1 


N 

— argmax n Pe(xj), 

3 = 1 


is both consistent, (0 MLE N -E 0, -E oo, in probability) and asymptotically 
efficient (Cramer-Rao lower bound is reached when the sample size, n, tends 
to infinity). For other examples, see Exercises 14.13.13 and 14.13.14. 

In this case Cramer-Rao inequality can be written as 

E KKle,n -°)0mle,n -°) T } > ^ 0 - 1 ( 0 )> 

see Exercise 14.13.16. The fact that the maximum likelihood estimator is 
asymptotically Fisher-efficient can be written as 

lim NE[(0 mle<n - 0)(0 MLEtN - 0 ) T ] - g~\e). 


see also Exercise 14.13.16. 

Fisher efficiency in online learning Since in Online learning training 
examples are used only once, as they appear, the asymptotic performance 
of online learning should be not as good as the optimal batch procednre, 
when all examples are reused for several epochs. The next resuit States that 
actually the efficiency holds, provided some extra conditions are satished. 
The next resuit can be found in Amari [6]: 


Theorem 14.8.1 Let the cost function be the log-likelihood function, 
C(x, z\ 9) — ln p(x, z\ 9). Let 0* represent the parameter true value of the dis- 
tribution from which the data are sampled, i.e., (x n ,z n ) ~ p{x,z\9*). Then 
the natural gradient learning rule for the online learning 


^n+l — 9 n 




/s 

produces an estimator 9 n , which is asymptotically Fisher-efficient, i.e., 


lim nE[(0 n - 0*){0 n - 0*) T } - g(0*)~ 1 - 

n^oc 


Neuromanifolds 


493 


The proof idea is to consider the covariance matrix 

V n = E[(6 n -6*)(6 n -d*) T } 


and show that it verifies the asymptotical relation 


Vn = -g(e*T 1 + o(±) 

n \n z / 


This is obtained by subtracting 0* from both sides of the Online learning 
relation 

A -A ^ 1 A A 

^n +1 ~ 9 n 9 @n) 

n 

and then taking the expectation of the square of both sides. The computa- 
tion involves the linear approximation of the derivative of the log-likelihood 
function 


V ot(x n ^ z n 5 Oji) 


VoliXn.Zn^ + iO. 

+o(\\e n -e*\\ 2 ), 


0*) T VeV e t(x ni z n ;9*) 


as well as a few more relations 

E [V 0 £(x,y,e*)] 
E [VoV 0 e(x,y,9*)\ 

g(o n ) 


o 

-g(P) 


Adaptive implementation The natural gradient algorithm requires that 
the inverse of the Fisher metric, g{9)~ 1 is known, fact that hardly occurs in 
a closed form. An adaptive method for directly estimating the inverse g{9)~ 1 
and applying the natural gradient Online learning is given in [ 8 ] 

g~h = {l + e n )g~ l - e n g~ l V E ufn{^Eufn) T Qn l 

1 — @n Vng n ZnJ 0 n ), 

where f n — fo(x n ) is the input-output mapping, g n — g(9 n ), and e n > 0 is a 
small learning rate. 


14.9 Log-likelihood and the Metric 

This section States a relation between the magnitude in the change of the 
log-likelihood function in terms of the Euclidean gradient of the input-output 
mapping. 







494 


Deep Learning Architectures, A Mathematical Approach 


If the parameters of a neural network are perturbed infinitesimally from 
9 to 9' — 0 + d9, then the input-output mapping changes from fo(x) to fofx ), 
where 


N 


fe’(x) = f e {x) + ^2 


k=1 


dfejxj 
dQk 


dO k = fe(x ) + (' V Eu f,dO) 


(14.9.33) 


with the infinitesimal perturbation vector dQ — (dO i,..., d0^) T . 

The square of the distance between the infinitesimally separated points 
9 and 9' in the parameters space 0 with respect to the metric g is given by 
the quadratio form 



e' - e\\ 2 g = ( de) T g{e)de = ’^rg ij (e)de i de j . 

hJ 


We note the similarity with the Enclidean distance, ||c?0||^ w = ^2j{d9j) 2 . 
The infinitesimal change of parameters has an effect on the change of the 
log-likelihood function. This is given by the next resuit. 

Propositiori 14.9.1 (a) The infinitesimal change in the log-likelihood func¬ 
tion is 

d£(x,y;Q) = (y - f s (x))df e (x); 

( b ) The square of the magnitude is given by 

\\d£(x,y\9)\\ 2 = (y - fe{x)) 2 {V E ufe{x)) T g{0)V E ufe{x)O{\\d9\\ 2 ). 

Proof: (a) For 9' — 9 + d9, the change in the log-likelihood function £(9) = 
£{x,y\d) is 

m-m = = 

k k 

= ( y - fe(x ;)) ^2 de k = (y- fo(x))(fe'(x) - fe(x)), 

k k 

where we used formulas (14.3.12) and (14.9.33). Substituting now d£(x , y; 9) — 
£{9') — £(9) and dfo(x) — fofx) — fo(x ), we obtain the desired formula. 

( b ) Taking the square of the magnitude in the g-metric in the relation from 
part (a), we have 


d£(x,y,0)\\ 2 g = (y - f 0 (x)) 2 \\df 0 (x) 


2 

9 ’ 










Neuromanifolds 


495 


The second factor in the right side can be evaluated as 


dfe(x ) 


2 

9 


(df 0 (x)) T g(d)dfg(x) = ^ 


dfe 


(d0 k ) T g(0) ^2 

j 3 


= E 


90/e <90 




8fe_(W 

90/e 90^ 


0^(0)11^11 ll^j 


j,k ~ 3 j,k 3 

= 0(vw^vw^)o(ll^ll 2 ) = livw^godl^ll 2 ). 

We have used the formula (d0/e) T g{0)d6j — ^j/e(0)O( 11 eZ0 11 2 ), which follows 
from the linear algebra relation e^Aej — Aj /e, where A is a matrix, Aj^ the 
(j, fc)th entry and {e/e} an orthonormal basis; in our case, A — g(Q ) and 
e/e = eZ0/e/11eZ0/e||. Expressing || ^Eufe\\ 2 g as i n Exercise 14.13.12, part (6), we 
obtain the desired relation. ■ 


14.10 Relation to the Kullback-Leibler Divergence 


It is known that the proximity between two probability densities, p(x, y; 0) 
and p(x,y;6 f ), on the neuromanifold S associated to a neural net can be 
measured using the Kullback-Leibler divergence. This section shows the rela¬ 
tion between this proximity and the Riemannian distance between 0 and 
0' = 0 + d6 in the parameters space (0,y(0)), see Fig. 14.5. 

The following resuit will be useful shortly. 

Lemma 14.10.1 7/T(x,y;0) denotes the log-likelihood function, we have 


22 p xy(0) 


r 9 


vx.y-,0) 



Proof: Using the dehnition of the expectation and log-likelihood function, 
we have 


E PxyW A-l(X,Y\6) 

180j 



8 


80 i 


i(x, y; 0 ) p(x, y; 0 ) dxdy 



p(x, y; 0 ) dxdy — 


80j 


80j 



p(x, y; 0 ) dxdy 


: 1 


= 0 . 


The previous resuit can be used to write the Fisher informat ion in the 
covariance form: 





















496 


Deep Learning Architectures, A Mathematical Approach 



Figure 14.5: The Riemannian distance between 6 and 6' is related to the 
Kullback-Leibler divergence of po and pg/ . 

Corollary 14.10.2 The Fisher matrix is given by the covariance matrix 


gij(O) = Cov(d di £, d e A) 


(14.10.34) 


where d$l — ^£(X,Y;9)- 


Proof: Using Lemma 14.10.1 and the covariance definition, we have 


Cov(d di £, d d A) = 


r\ r\ 

E P XY (0) \Y; 0)^-£(X, Y; 6) 


89 i 


d6 : 


_ E p xy(S) 


d 


ide i 


£(X,Y-0 ) 


E Pxy(0) 


d 


ide, 


£(X,Y-9) 


=0 


=0 


— 9ij (ff) • 


Denote, for simplicity, po — p(x, y ; 6) and consider 6' — 9 + dO. The next 
resuit shows that the proximity between p$ and pgt measured by the Kullback- 
Leibler divergence is half the sqnared Riemannian distance between 6 and 9' 
in the space (0, g). 

























Neuromanifolds 


497 


Propositiori 14.10.3 The linear and quadratio approximations of the 
Kullback-Leibler divergence are given by: 

(a) D KL (p e \\pe>) = 0(\\d6\\ 2 ); 

(b) D KL (pe\\pe') = \\\d0\\ 2 g + O(\\d0\\ 3 ). 

Proof: (a) Let Fg : R N —> [0, oo), given by Fg(u) = D KL (pg\\p g+u ). From 
divergence properties, Fq( 0 ) = Dxl(po\\po) — 0 . In order to compute the 
partial derivative, we consider the variation in the e^-direction. Using the 
dehnition of the derivative as a limit, we have 


d 

duj 


Fg( 0 ) 


lim Fg(tej) - Fg( 0 ) _ Dkl( pe I \pe+te 3 ) 

t— t t-> o t 


|™ [09) - 09 + tej )] = - 


-E Pe 


" £(0 + tej) — 09)' 

lim--- 

_ t —^-o t 


lim E Pe 

t->o 


'1(6 + tej) — £(6)~ 
t 




where the last identity is provided by Lemma 14.10.1. 

Since the first two terms of the right side of the linear approximation 

Fg(u) = Fg( 0) + J-F e (0)d Uj + 0((\\du\\ 2 ) 

3 3 


are zero, we obtain Fq(u) — 0((\\du\\ 2 ), which is equivalent to 


Dkl(po\\po') = O(\\d0\\ 2 ). 


(b ) Similarly, taken second partial derivatives we obtain 


d d 


dui duj 


Fq(u) = 


d d 


dui duj 

d d 


dui duj 


Dkl(po\\po+u) 

E Pd \l(6) - 1(6 + u) 


= -E pe 


d d 


-dm du 


-1(6 + u) 


3 


Using the dehnition of the Fisher metric coefhcients given by (14.2.8), we 
have 


d d 


dui duj 


Fg(u) 


d d 

-E Pe „ „ 1(6 + u) 


u—0 


= -E Pe 


-diti duj 

d d 


u—0 


-dui duj 


09) =9ij(e) 





































498 


Deep Learning Architectures, A Mathematical Approach 


We have assumed the derivatives commute with the expectation operator, 
fact that always hold for densities of Gaussian type. 

The quadratio approximation 

a 1 d d 

F 0 (u) = F e ( 0) + —Fo(0)duj + - o^.Q^~. Fe ^ duiduj + 0((||cb|| 3 ) 

3 3 3,k 1 3 

1 d 

= -^2 Q—gij(9) duiduj+ 0(\\du\\ 3 ) 

Z j,k ° Ui 

can be written as 

D K l(p$\\P8 ') = 2 T ddjddj + Q(\\dO\\ 3 ) 

j,k 

= l(d6) T g (6)de + o(\\def) = hde\\ 2 . + o(||d0|| 3 ). 

2 2 y 


Some concepts of Differential Geometry defined on the Riemannian man- 
ifold ( 0 ,g) can be expressed in terms of statistical concepts on the neuro- 
manifold S. We shall do this for the energy and length of a curve. 


Let 9 : [a, 6 ] 0 be a differentiable curve in the parameter space 0, 

endowed with the Fisher metric g{6). The energy of the curve is the integral 
of the kinetic energy density along the curve 


m = lf\\m\\ 2 g dt. 

We shall provide a quantitative characterization of the energy in terms of the 
Kullback-Leibler divergence of the probability density p$py Note that the 
assignment t —> Pe{t) ' ls a curve on the neuromanifold S. 

We consider an equidistant partition a = to < t\ < • • • < t n — b, with 
At — tjs+i —tk — [b — a)/n, and denote 6 & = #(£&). The Riemannian distance 
between the points 9 & and 9k+ 1 , for n large, can be expressed by Proposition 
14.10.3, part (6), as 


1 

2 




DKL(p0 k \\P0 k+1 )- 




Neuromanifolds 


499 


Using this, we can evaluate the energy as 


‘6 


n 


m = - 


a 


6(t)\\ 2 dt = lim 9 

^ n—^oo z —' z 

/c=l 


1114+1 -0*Jd At 


(At) : 


n 


= liuE a7^(poJI%+i) 


n—^oo — At 
k =i 


= lim 


n 


n 


n^oo b — a 


^D KL {p ek \\pe k+1 ). 


k =1 


A similar problem regarding deformation of oval curves has been recently 
asked in [21]. 

The length of the curve 9{t) is obtained integrating the speed ||0(t)||^ 
along the curve with respect to the time parameter t as 


*b 


L(Q) = / \\0(t)\\dt. 


a 


The length can be expressed in terms of Kullback-Leibler divergence as in 
the following 


>6 


n 


L{6) — / \\6{t)\\dt — lim 

i n—^ oo ^ J 


a 


k =1 


^/c +1 ^/c||g 

At 


At 


n 


n 


= lim = >/2 lim Y ^ D KL (p e 


/c=l 


n—?► oo 


P 0 


fc+i 



fc=l 


where the last identity has used Proposition 14.10.3. 


14.11 Simulated Annealing Method 

In the previous sections we have added a Gaussian noise, n ~ A/*(0,1), to 
the output of a neural network, see (14.1.1), and then we approached the 
problem by techniques of Information Geometry. We’ve seen that learning is 
performed by the natural gradient algorithm, which involves the inverse of 
the Fisher metric. In this section we shall use the previous results to make a 
relation with the simulated annealing method. 

The regular gradient descent method applied to a deep neural network 
with the output Y — fo(X) provides in most cases, due to the high non- 
linearity of /#, only local minima of the cost function. In order to obtain a 
global minimum, a variant of the simulated annealing method will be used. 
For this we shall consider an adjustable noise, tit A/*(0,T 2 ), where T plays 
the role of temperature. 










500 


Deep Learning Architectures, A Mathematical Approach 



b 



c 


Figure 14.6: Annealing method: a. For a large temperature , Ti, the opti- 
mal parameter 61 is located in a neighborhood of the global minimum, b. 
Decreasing the temperature to T 2 , we obtain a more accurate approximation 
of the global minimum given by the new optimum value , 62 . c. Continuing to 
decrease temperature we obtain more and more accurate approximations of 
the global minimum. 



The heuristic idea is to start optimizing the cost function starting from a 
large temperature, T, and then decreasing it to zero, according to a certain 
schedule. If the schedule is T\ > T 2 > • • • > Tjy > 0, we denote by 9\ the 
optimal parameter corresponding to temperature Ti, obtained by the natural 
gradient learning method. The search of the next optimal parameter value, 9 ^, 
which corresponds to temperature T 2 , starts from 0*, see Fig. 14.6. In general, 
the optimal value 0j£, corresponding to the temperature, T&, is obtained by 
the natural gradient descent, which starts the search from the initial value 
0£_ 1 . The last optimal value, 9* N) corresponding to the lowest temperature, 
T/v, is the closest to the true global minimum of the cost function. 

14.12 Summary 

This chapter provides an introduction to the Informational Geometry of neu- 
ral networks. Such networks are noisy and the output is characterized by a 
probability density par amet rized by weights and biasses. Thus, each distri- 
bution can be considered as a point in a space, which becomes a Riemannian 
manifold when endowed with the Fisher metric. This is the neuromanifold 
associated with the given network. 

The topics covered here deal mainly with the intrinsic geometry of a 
neuromanifold, which is dehned by the Fisher information metric. This metric 
is computed explicitly for a few particular types of networks and is applied to 
the natural gradient learning algorithm, which is an adapted version of the 
gradient descent algorithm for Riemannian manifolds. Inserting noise in a 
neural network is like increasing temperature of a thermodynamical System. 
A variant of the simulated annealing method works in combination with the 



Neuromanifolds 


501 


natural gradient descent method in order to obtain the global minimum of 
the cost function. 

There are several important topics of Information geometry which are 
left out of this chapter, such as the extrinsic geometry of the neuromani- 
fold, which describes the relative geometry of a network with respect to a 
larger manifold of probability densities. Topics, such as embedded curvature, 
dnal connections, etc., can be found by the interested reader in Amari [4] or 
Calin et al. [ 22 ]. For more applications of Informational Geometry to Machine 
Learning, the reader is referred to [7]. 


14.13 Exercises 


Exercise 14.13.1 Let {^i,..., u n } be an orthonormal basis in R n (i.e., a set 
of n vectors such that vjvj — Sij). 

(a) If G is an n x n symmetric matrix such that vjGvj — 0, for all 1 < j < n, 
show that G = O n (the n-dimensional zero matrix). 

( b ) If A and B are two n x n symmetric matrices such that vjAvj = vjBvj , 
for all 1 < j < n, show that A — B. 


Exercise 14.13.2 A 2 x 2 matrix is said to be a rotation of angle (j) if it has 
the form 



u i U 2 
Vi V2 


cos <fi sin qb 
— sin 0 cos 4 > 


We note that u and v are orthonormal vectors, and det i? = 1. 

(a) Let X — (X U X 2 ) ~ A/"(0,l2) and consider the rotation matrix R as 
above. Show that u T X and v T X are independent, where u T — (^ 1 , 1 x 2 ) and 
V T = (vi,v 2 ). 

(b) What are the distributions of u T X and v T XI 

(c) Show that if u and v are two orthonormal vectors in the plane, then there 
is 0 G [0, 27t) such that u T — (cos 0, sin 0) and v T — (— sin 0, cos <j>). 


Exercise 14.13.3 Let X — (Ai,A 2 ) t , with Xi, X 2 independent random 
variables. 

(a) Consider u and w orthonormal vectors in R 2 . Show that Y\ — u T X and 
I 2 = W T X are also independent. 

( b ) Show that part (a) holds just if u and v are only orthogonal (the magni- 
tudes of the vectors do not matter). 


Exercise 14.13.4 Prove Lemma 14.5.2 


502 


Deep Learning Architectures, A Mathematical Approach 


Exercise 14.13.5 (a) Find the Fisher metric of a sigmoid neuron with the 
activation function (f>(x) — <r(x), where cr(x) denotes the logistic fnnction. 

( b ) Show the inequalities 

0 < 900 < ^ 2 , 9ok < ^4 9ij < 

(c) State and prove a variant of the inequalities given in part ( b ) in the general 
case of an activation fnnction </>. 

Exercise 14.13.6 Find the Fisher metric coefficients for a neuron with the 
input X — with X{ independently identic distributed, X{ ^ 

Unif[ 0,1]. 

Exercise 14.13.7 Find the Fisher metric coefficients for a one-hidden layer 
neural network with the activation function, (f>(x) — x. Write the resuit in 
terms of the network parameters and the hrst two moments of the input 
variable, X. 

Exercise 14.13.8 Find the Fisher metric coefficients for a one-hidden layer 
neural network with the activation function, (f>(x) in the case when the input 
is X rsj J\f( 0, 1). 


Exercise 14.13.9 Consider the loss function L : 0 —> R, a vector v in the 
tangent space TqQ with \\v\\ 2 g — 1, and a learning step, rj > 0. A small change 
of the parameters in the direction of v, of magnitude 77 , can be written as 
dw = rjv, so that the linear approximation becomes 

L(9 + de) = L(6) + v V Eu L(e) T v. 

We need to find the direction v such that L{6 + dO) is minimized. For this, 
we consider the Lagrange functional 

F(v, A) = VL(9) t v - A|Mg. 


(a) Show that the variational equations 

( b ) Show that v = X g L(0)/\\X g L(0)\\ g . 


dF 

dvi 


0 imply X Eu L(w) = 2 A g{0)v. 


Exercise 14.13.10 Let pxi (# 1 ; 0) andpx 2 ( x 25 ^) be the probability densities 
of random variables X\ and X 2 , respectively. Then 

9 (Xi,X 2 ; 6 ) = g(Xp 6) + g(X 2 \XpO) = g(X 2 ; 6 ) + g(X 1 |X 2 ; 0), 

where g(Xi\X 2 ; 9) is the Fisher information dehned by the conditional prob¬ 
ability density Px 1 \x 2 ( x i\ x ‘2i (i - e -5 fhe amount of information about 6 con- 
tained in Xi, given X 2 ). 



Neuromanifolds 


503 


Exercise 14.13.11 Let X — Ylk=i a vec ^ or field on ©• Show that: 

(a) (V Eu f,X) Eu = ^ =1 X k §- k . 

(c) (V E uf,X) Eu = g(V g f,X). 

Exercise 14.13.12 Show that: 

(«) l|V fl /||| - C V Eu f) T g-\0)\7 Eu f ; 

( b ) \\VEufWl = (VEuf) T g(e)V Eu f; 

( c ) VW and X g f vanish at the same points. 

Exercise 14.13.13 Consider the 1-dimensional random variable X !) 

and , x n ) = ^ Y!i=i Xi an es timator for the mean /iby n indepen- 

dent observations of the variable X. 

(a) Show that jl{x i,..., x n ) is an unbiased estimator of the mean, /r; 

( b ) Find the Fisher information of X; 

(c) Show that /r(xi,..., x n ) is Fisher-efficient. 

Exercise 14.13.14 Let X ^ Pois(X) be a Poisson-distributed discrete ran¬ 
dom variable with parameter A, i.e., 

\k 

P(X — k) — k — 0, 1 , 2, .... 

fc! 

Construet a Fisher-efficient estimator for the parameter A. 

Exercise 14.13.15 Consider the independent, identically distributed ran¬ 
dom variables, Xi,...,Xyv ^ X, with X ^ A/"(/i, 1), and consider their 
average 

X = ^(X 1 + ... + Xn). 

Show that the information contained in X about /i is the sum of the infor¬ 
mation of each individnal variable about /i, i.e., /(X) = X/(X), where I 
denotes the Fisher information of a one-dimensional random variable. 

Exercise 14.13.16 (a) Let Xi and X 2 be two independent random vari¬ 
ables with probability densities, px 1 (xi; 9), px 2 ( x 2^)5 which depend on the 
parameter 6. The information about 6 contained in Xi is given by the Fisher 
information g(Xp 6). Prove that the Fisher information contained in the pair 
(Xi,X 2 ) is the sum of individual Fisher informations 


g(X 1 ,X 2 ;e) = g(X i; e) + g(X 2 -,e). 



504 


Deep Learning Architectures, A Mathematical Approach 


(6) State and prove a generalization to n independent random variables. 

(c) Show that the inverse of the Fisher information matrix contained in N 
identically distributed independent random variables Xi, ..., Xjy ~ X about 

0is ±g~ l (X-,e). 

(d) Use part (c) to explain why the dehnition of an asymptotically efficient 
estimator 9(N) of 6 based on N independent identically distributed random 
variables, X\ ,..., Xw, is given by 


lim NE[(§(N) - 6)(d(N) - 0) T } = g~ l {0), 

N^oo 


where g{6) is the Fisher information matrix corresponding to one of the 
random variables. 



Part V 
Other Architectures 



® 

Check for 
updates 


Chapter 15 

Pooling 


Pooling is a machine learning technique that provides a summary of the input, 
selecting some essential local features such as maxima, minima, averages, etc. 

It also acts as an information contractor; in the discrete case it decreases 
the dimension of the input by a certain factor. Hence its usefulness in classi- 
fication problems. 

The idea of pooling is to consider a partition of the domain of a function 
and replace the function on each partition element by the “most represen- 
tative” value of the function on that set. This procedure leads to a simple 
function. A two-dimensional variant of pooling is used in the construction of 
convolutional neural networks. 


15.1 Approximation of Continuous Functions 


This section deals with the max, ram, and average-pooling techniqnes applied 
in the context of a continuous function on a compact set. For the sake of 
simplicity, we shall prove the resuits just for the case of a one-dimensional 
compact interval, [a, &], while the reader can easily extend the resnlts to 
multiple dimensions. 

Max-Pooling Let / : [a, b} -A R be a continuous function and consider the 
equidistant partition of the interval [a, b} 

a — xq < x\ < • • • < x n -i < x n — b. 


The partitions size, —-, is called stride. Denote by Mi = max f{x) and 

R [Xi-i,Xi\ 

n 


consider the simple function S n (x) — Mjl\ Xi _ 1 , x .}(x). The process of 

507 


i —1 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_15 







508 


Deep Learning Architectures, A Mathematical Approach 



Figure 15.1: Whenn increases, the difference Mi — rrii decreases towards zero. 


approximating the function f(x) by the simple function S n (x ) is called max- 
pooling. More details can be found in Zhou and Chellappa [132]. 

Min-Pooling As a variant, we may consider mi — min f{x) and dehne the 

[Xi-uXi]* 

n 

simple function s n (x) — l X p(x). The min-pooling is the process of 

i— 1 

approximating the function f{x) by the step function s n (x). We note that 
the following double ineqnality holds: 

s n (x) < f{x) < S n (x ), Mn > 1. 


Average-Pooling Consider the average of the function / on the interval 

1 f Xi 

Xi-i,Xi\, given by / p — - / f{u)du. Pooling the average of the 

%i—l Jxi—i 

n 

function on each interval, we obtain the function A n (x) — E ^i^[xi-i,Xi) ( x )- 

i— 1 

The next resuit States that all previous pooling functions are “good 
approximators” for /. 


Theorem 15.1.1 Let f : [a, b\ —>> R be a continuous function. Then all three 
function sequences, (S n ) n , (s n ) n , and (A n ) n , converge uniformly to f on 
[a, b\, as n —> oo. This means that Vc > 0, there is N > 1 such that 


S n (x) - f(x)\ < e, | s n (x) - f(x)\ < e, \A n {x) - f(x)\ < e 


\/x G [a, 6 ], Vn > N. 





































Pooling 


509 


Proof: Construet the sequence 


n 


U 


n 


(X ) = s n (x) - s n (x) = P^(Mi - mi) l[ Xi _ uXi )(x) 


2=1 


which satisfies the following properties, see Fig. 15.1: 

(i) u n (x) > 0; 

(ii) u n +\(x) < u n (x), for any n > 1 ; 

(iii) u n (x) -G 0 , as n -G oo, for any fixed x. 


Step 1. We show that (u n ) n converges to 0, uniformly on [a, b]. 

Let e > 0 be arbitrary fixed. From the uniform continuity of / on [a, 6 ], there 

b — a 


is N > 1 such that if \x — x \ < , then \ f(x) — f(x')\ < e. Now, in each 

interval of the partition there are values <^,<U g such that Mi — 

/te) and mi = /te)- Since l£-£l < “Ct then M i~ m i = l/te)-/te)l < e - 
This implies 

n 


N 


-™i)l[* i _i,x i )(aO <e, Vx€[a,6]. 


2=1 

This means that \u n (x)\ < e, \/x G [a, 5], and Vn > 7V, i.e., (u n ) n converges 
uniformly to 0 . 

Step 2. We show that ( S n ) n converges to /, uniformly on [a, b}. 

Since s n < /, then the following inequality holds: 

S 72 f — ($>n Sn) T (<5 n f) — S n — 

for any n > 1. Let e > 0 be arbitrary fixed. Using SYep 1 together with the 
previous inequality, we have 


S n (x) — f(x )| < \u n (x)\ < e, n > 1, Vx G [a, 5]. 


This means that (S n ) n converges uniformly to / as n —> oo. 

5. We show that ( s n ) n converges to /, uniformly on [a, b]. 

This is similar to Step 2. Since S n > /, then 

f s n — (f S n ) T (S n s n ) T S n s n — Uji, 

for any n > 1. Let e > 0 be arbitrary fixed. Using Step 1 and the previous 
inequality, we have 


f(x) — s n (x) | < \u n (x)\ < e, n > 1, Mx G [a, b}. 






















510 


Deep Learning Architectures, A Mathematical Approach 


This means that (s n ) n converges uniformly to / as n -A oo. 

Step 4’ We show that (A n ) n converges to /, uniformly on [a, b\. 

Let e > 0 be arbitrary fixed. By the integral version of the Mean Value 
Theorem, there is a x* G [xi-i,Xj\ such that pi — f(x*). Therefore, 

mi < pi < Mi. 

Multiplying by the indicator function \ Xl _ llXl )( x ) and summing over i yields 

^n(^) ^ A n {x2) Ai S n {x^. 

This implies | A n {x) — s n (x)\ < u n (x ) and | S n (x) — A n (x)\ < u n . Using Step 
1 it follows that | A n (x) — s n (x)\ —> 0 and | S n (x) — A n {x)\ —> 0 uniformly, as 
n —> oo. Now, triangle inequality provides 


^n(aj) -/(aj)| = 


A-n{x) ~ f(x) + S n (x) - ^(x)! 

< S n (x) - A n (x) \ + | S n (x) - f(x) 


e e 

< —|— — e. 
~ 2 2 


where we used Step 2. 


Remark 15.1.2 The pooling can be extended to the multidimensional case, 
where f : K R is a continuous function dehned on a compact set K C R n . 
Consider the covering K — U ^ =1 ^ °f the compact K , where Ai n Aj — 0 
for i 7 ^ j with Ai disjoint open sets and Ai denotes the closure of Ai. 1 We 
pool the maxima Mi — maxq. f(pc) and consider the approximation S n (x) = 


max sup 

1 <i<n x ,yeAi 


x-y 


0 


as n —> oo, then a proof similar to the previous one shows that S n converges 
uniformly to / on K. 


15.2 Translation Invariance 

In this section we shall prove the property of local translation invariance 
for the max and min-pooling. Consider the notation T a for the translation 
operator dehned by (T a o f)(x) — f(x — a), for any real variable function / 
and a G R. We also denote by V(f) the min- or max-pooling function of / 
associated with a given partition. 


x The closure of an open set A is the set A together with its boundary. 













Pooling 


511 


▲ 

y 



Figure 15.2: Max-pooling for f and T a o f. 


Propositiori 15.2.1 Let f : R R be a continuous function. There is a 
partition of R such that 

P(T a of) = V(f), 
for any small enough value of a. 


Proof: We shall perform the proof in the case of max-pooling. The min- 
pooling can be treated similarly. The proof idea is that under small transla- 
tions the maxima do not leave the partition intervals, see Fig. 15.2. 

Choose a finite partition [xi, Xi+i) 0 < i < N — 1 such that the maxima, 
of the restriction /\[ Xi , Xi+1 ) are inside the open intervals (xi,Xi+ 1 ). There 
is a 77 > 0 such that Xi + 77 < ^ < Xi+\ — 77 . Then choose a E R such that 


a 


< Tj. 


Since the graph of T a o / is obtained by shifting horizontally the graph of 
/ by an amount a, then the maxima do not leave the intervals and we have 

Mi(f ) = max f(x) = max f(x - a) = max (T a o f)(x) = Mi(T a o f). 

\xi 1 ] + 


Therefore, the functions / and T a of will have the same max-pooling functions. 


Remark 15.2.1 (i) The invariance property extends to several dimensions 
with only minor alterations in the proof. 

(ii) The previous property provides stability of the pooling under small input 
variations. 










512 Deep Learning Architectures, A Mathematical Approach 



Figure 15.3: Max-pooling layer with Yj — max{Xij,..., X p j }, 1 < j < N. 

15.3 Information Approach 

Another way to look at pooling is by investigating its effect on information 
content. This section deals with the case of max-pooling, but similar results 
hold also for the average-pooling. 

Consider n random variables, X\, X 2 ,..., X n and let Y — max{Xi,..., X n }. 
Let &(Xi) be the sigma-algebra generated by Xi, and 

&(X) = 6 (Xi,..., X n ) = &(Xi) V • • • V &(X n ) 

be the information held generated by all Xi. For any b E R we have 

y _ 1 (— 00 , b] — {cj; Y(cu) < b} — {cj; Xi(uS) < b, Mi = 1 ,..., n} 

n n n 

= p |{ui Xi(u>) < b} = p\Xp(-oo,b] e Pl &{Xi). 

i=1 i=1 i=1 

n 

Consequently, &(Y) C |^| &(Xi), that is, the information held of the maxi¬ 
mi 

mum of n random variables is included in the information held generated by 
each variable. Next, we shall apply this resuit to neural networks. 

Definitiori 15.3.1 We say that the ith layer of a feedforward neural network 
is a pooling layer if 

(i) the (£ — 1 )th layer is divided into a partition of N classes of neurons; 



Pooling 


513 


(ii) all neurons of the (£—1 )th layer that belong to the same class are mapped 
into the same neuron in the £th layer, whose activation is their corresponding 
maximum value; 

(' iii) the number of neurons in the £th layer is d^ = N, where N is the 
number of partition classes. 


Roughly stated, a pooling layer replaces each partition class of a layer 
by the maximum neuron value in that class. In Fig. 15.3, the (£ — l)th layer 
contains neurons with valnes X{j, l<i<p, divided into N 

classes 

{Xn 5 • • • 5 Xp\ j*, { A ^2 1 • • • 5 Xp 2 {XlN , • • • , Xp]y j*, 

each class having p neurons. Each class is pooled into its maximum value 


Yj — max{Xij,..., X p j}, 1 <j<N. 


From the previous computation, the information in each of the neurons of 
the pooling layer satisfies the inclnsion 


p 

e«) C f| GiXij). (15.3.1) 

i —1 

The information generated by the pooling layer is given by 


®on 



Inclnsion (15.3.1) implies 


N p 

c e[|J n©(Vj) 

j=1 i=1 

Using formula (b') of Appendix section A, yields 


(15.3.2) 


N p N p N 

u p) e(Xij)= u n = n (u ©pm)- 

j=li=l j=H r = l .7 = 1 

Then (15.3.2), with the help of Exercise 15.6.1 part (a), becomes 

N N 

6 (y)ce[ n (U©(Wi))] c n 6 [(lj 6 (v ri ) 

j =1 j =1 










514 


Deep Learning Architectures, A Mathematical Approach 



Figure 15.4: Two max-pooling layers applied to an MNIST image. 


which can be also written as 

N 

©me n v s <*w>= n S(X n l, • • • 5 • 

ii,...,ip j=l 

This relation has the following interpretation. From the first class we pick 
an arbitrary neuron, say The information generated by this neuron is 

&(X nl ). If this is done for each class, then the information generated by these 
arbitrarily class-picked neurons is ©(Xqi,..., Xi N jy). The previous inclnsion 
States that the information of the pooling layer, 6 (T), is contained in any of 
the information sets 0 (X^i,..., Xi N jsr), regardless of the arbitrary choice of 
neurons. 

15.4 Pooling and Classification 

Pooling is customarily used when the dimension of the input needs to be 
lowered to match the number of classification classes. For instance, in the case 
of the MNIST data, each input image has 28 x 28 = 784 pixels, while there are 
10 classification classes (the nnmbers 0,1,..., 9). This can be partitioned, for 
instance, into 14 x 14 squares, each having 2x2 pixels. From each 2x2 square 
we retain only the pixel of maximum intensity. As a consequence, we obtain a 
14 x 14 pixel image, which is the resuit of the first pooling. The second pooling 
divides the 14 x 14 image into 7x7 squares, each of 2 x 2 pixels; again, we 
retain from each of these only the pixel of maximum intensity, see Fig. 15.4. 
This is a process by which information is thrown away in an irreversible way. 

Pooling is usually used as a companion to convolution. The convolution 
layer filters the input signal, removing noise, while the pooling layer selects a 
rough summary of the hltered signal features, see Fig. 15.5. The convolution 
operation and convolutional networks will be discussed in more detail in the 
next chapter. 












Pooling 


515 





Figure 15.5: The effects of convolution and pooling: a. Raw and noisy input 
signal; b. Smoothed signal obtained by convolution; c. Pooled signal obtained 
by selecting maxima and minima. 

15.5 Summary 

Pooling is a machine learning technique by which the dimension of the input 
is decreased by a certain factor. It can be also considered as an information 
contractor by which the information is thrown away in an irreversible way. 
Pooling is used in classification problems together with convolution. 

15.6 Exercises 

Exercise 15.6.1 Let (Cj) be a collection of measurable sets. Show that 

(а) 6(p)c)cpl©(C); 

i i 

(б) 6(|je) ^lje(c). 


Exercise 15.6.2 There are N — 2 n participants to a chess competition. The 
participants compete in pairs. At each round, the Wiener of each pair is com- 
peting against the Wiener of another pair. The final Wiener is obtained after 
the nth round. Explain this procedure in the light of a max-pooling process. 

Exercise 15.6.3 (a) Prove Proposition 15.2.1 in the case of min-pooling. 

( b ) Formulate and prove a version of Proposition 15.2.1 for two-dimensional 
functions. 

In a neural network it is not recommended to place pooling layers consec- 
utively, since their composition can be written as an only one pooling layer. 
The next exercise deals with a more precise statement: 

Exercise 15.6.4 (a) Assume all layers of a neural network are max-pooling 
layers. Show that the final output of the net is the largest input. 

(■ b ) Assume all layers of a neural network are average-pooling layers. Show 
that the final output of the net is the average of the input. 


516 


Deep Learning Architectures, A Mathematical Approach 


(c) If all layers of a neural network are min-pooling layers, show that the final 
output of the net is the min of the input. 

Exercise 15.6.5 In a neural network a max-pooling layer is followed by a 
min-pooling layer. 

(а) Show that the network output change if we switch the order of these 
pooling layers. 

(б) Does the resuit stili hold if the min-pooling layer is replaced by an average- 
pooling layer? 



® 

Check for 
updates 


Chapter 16 

Convolutional Networks 


Convolutional neural networks (CNN) are feedforward neural networks with 
shared weights and sparse interactions, that is, most weights are equal to 
zero. Given their fewer number of parameters, convolution networks are more 
efficient to train than any similarly sized fully-connected layer networks, with 
only minor negative conseqnences on their performance. 

The excellent performance of CNNs in image processing is due to their 
adaptability to a 2-D grid-like topology. CNN is a biologically inspired piece 
of AI, whose design is based on neuroscientihc principies, and successfully 
applied to pattern recognition, see LeCun et al. [74]. 

Like almost any other feedforward neural network, CNNs are trained 
with a version of the backpropagation algorithm and use ReLU as activation 
function. We shall discuss in this chapter the architecture of CNNs based on 
concepts of local receptive field, kernel, convolution, and feature map. 


16.1 Discrete One-dimensional Signals 

A discrete one-dimensional signal can be described by the double infinite 
sequence of real numbers 


y • 5 y~2 ? y~i ? 2/0 5 2/15 2/2 5 • • •] 


where represents the signal amplitude measured at time (Fig. 16.1) 
A signal y is called: 

• finite signal if max^ \y^\ < oo, i.e., \\y\\oo < oo; 

• L 1 -finite signal if YlkL-oo \Vk\ < 00 ? he., ||y||i < 00 ; 

• finite energy signal if X^fcL -00 \yk\ 2 < 00, i.e., H2/II2 < 00. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_16 


517 





518 


Deep Learning Architectures, A Mathematical Approach 


y k -1 


y k 

• y k+i 


time 


4-i 4 * 


&+1 


Figure 16.1: Discrete one-dimensional signal. 


• compact support signal if there is 7V > 1 such that yr — 0 for any 

|fc| > N. 

It is worth noting that a finite energy signal with compact support is 
lA-finite. This follows from the Cauchy’s inequality 





N 

2 N kl 2 

k=—N 


A signal can be processed by filtering. This involves a convolution opera- 
tion between the signal and a kernel or filter , as we shall dehne next. A kernel 
(filter) is a compact support seqnence of weights 


re — 


•, W-2, W-l, W2, • • • 


The convolved signal between y and w is the signal 


z — 


. . . , Z—2 5 Z—\^ Zq , Z \, Z2 •) • • • 


denoted by z = y * w, and dehned by 

oo 

Zj = ^2 Vj+kWk■ (16.1.1) 

k =—oo 

The above inhnite sum makes sense since w has only a finite number of 
nonzero elements. Each component of the convolved signal z is a weighted 
sum of the components of the initial signal y. The effect of convolution is to 
average out a signal using some given weights System. Equivalently, sliding the 
filter re, then multiplying by y and summing, produces the hltered signal z. 

It is worth noting that formula (16.1.1) is known in signal processing as 
cross-correlation. 


Example 16.1.1 (Moving average signal) If the kernel is given by 


w — 


..., 0, 0, reo 
















Convolutional Networks 


519 


then z = y * w is a moving average signal. In this case each term of the 
sequence is replaced by the arithmetic average of two consecutive terms, i.e., 


z 3 = 2 + %■) 


The shifting details can be inferred from the next table: 



y- 2 y- i 

2/o 

2/1 

2/2 

2/3 

2/4 

z 3 

i 

= 0 

0 

0 

1/2 

1/2 

0 

0 

0 

^0 

= (2/o + 2 /i)/2 

i 

= i 

0 

0 

0 

1/2 

1/2 

0 

0 

Z\ 

= {Vi + 2/2)/2 

i 

= 2 

0 

0 

0 

0 

1/2 

1/2 

0 

Z2 

= (2/2 + 2/3)/2 


Similarly, if the hlter is given by w — [..., 0, 0, 0, 0,...], we obtain 

a moving average signal as an average of three consecutive terms 


z j - 3 (Vj-i + Vj + Vj+ 1 )- 


16.2 Continuous One-dimensional Signals 


In this case the signal amplitude is given as a continuous function of time, 
y — y{t). The previous dehnitions can be adapted for the continuous case as: 

• finite signal if ||2/||oo < oo; 

• L 1 -finite signal if \\y\\i — f R \y(t)\ dt < oo; 

• finite energy signal if \\y \\2 — ( f R \yif)\ 2 dt) < oo; 

• compact support signal if there is u > 0 such that y{t) — 0 for any 
t\ > u. 

The hlter in this case is a continuous function with compact support, 
w — w{t). This means w{t) — 0 for \t\ large enough. The convolved signal, 
z = y * w, is dehned by 


z(t) — (y * w)(t) — / y{u + t)w(u) du = / y(y)w{v — t) dv. 

J M J M. 

This formula is the continuous version of the cross-correlation between the 
continuous signals y and re. It is worthy to note that convolution in mathe- 
matical literature is dehned using a hipped sign than in the previous formula. 
However, this is the way this the concept becomes useful in neural networks. 

Similarly with the discrete case, the statement that any hnite energy 
signal with compact support is L 1 -hnite holds also in the continuous case. 
Furthermore, any hltered L 1 -hnite signal is also L 1 -hnite. This follows from 




520 


Deep Learning Architectures, A Mathematical Approach 


the following estimation 




i — 


< 


/ \z(t)\dt = / / w{u)y{u + t)du 

J Jr 

J J w{u)y{u + t) dudt — J \w(u)\ J \y(u + t)\dtdu 


— J \w(u)\ J \y{r)\drdu — J \w(u)\ \\y\\i du = ||y||i 

where we inverted the order of integration using Fubinks theorem. 


w 


i 


16.3 Discrete Two-dimensional Signals 


A discrete two-dimensional siynal is an infinite matrix y — [yij\i,j, where yij 
is the activation of the (z, j)-pixel. This way, any black and white picture can 
be considered as a two-dimensional signal. If the (z, j)-pixel is black than the 
activation is yij — 1; if the pixel is white, the activation is yij — 0; any other 
gray tone is a number between 0 and 1. 


A signal y is called: 

• L 1 -finite if IX-oo HjL- oo \Vjk\ < 

• finite energy signal if Xfcl-oo Eji-oo \Vjk? < oo; 

• compact support signal if there is N > 1 such that yjk — 0 for any 
\j\ > N and \k\ > N. 

A kernel in this case is a compact support signal w — [wij\. The signal 
V ~ [yjk] can be convolved with the kernel w — [wij\ into the signal z = y*w, 


as 



oo oo 

E E yi+kj+r^k^r • 

k =—oo r =—oo 


In the two-dimensional case the output Zij is also called a feature map , since 
is supposed to contain some image features that are characteristic to the 
kernel w. 


Remark 16.3.1 For a continuous two-dimensional signal, y : R 2 -> R, its 
convolution with a continuous filter w : R 2 —> R is obtain by the integral 
formula 


z{t\M) 


I y(u\ + t\,u<2 + t^) w(u\,U2) duiduz 

M 2 

f y(vu v 2 ) w(v 1 -ti,v 2 - t 2 ) dvidv 2 . 

M 2 


The reader should have no difficulty to extend the defmitions of the L 1 -hnite 
energy signals and compact support signals from the discrete case to the 
continuous case. 



Convolutional Networks 


521 


Example 16.3.2 (Moving average in 2d) Consider the feature map with 
a 2 x 2 support 

/ 0 0 0 0 \ 

0 1/4 1/4 0 

W ~ 0 1/4 1/4 0 ' 

\ 0 0 0 0 / 

The convolved signal is obtained by averaging out four neighboring activations 

z ij ~ + Ui,j +1 + Vi+lj + Vi+lj+l)- 

In the following we shall denote the convolution operator by C, so C(y) is 
the output of a convolution layer with the input y. In the previous notations 
we have z = C(y). We shall also denote by T a ^ the translation operator in 
the direction of the vector (a, b) by (T a? & o y)^ — We note that the 

L 1 -finiteness, hnite energy and compact support properties transfer from a 
signal to its translation. 

Propositiori 16.3.3 (Equivariance to translation) Convolution opera- 
tion preserves translations, that is 

C(T a ,b°y ) = T a ,b o C(y). 

Proof: First we note that ( T a ^ o y)^ = y i _ a j _i ) . Then for any hxed indices i 
and j, we have 

C(T a b ° lf)ij ° I/) * ij ^ y ^ ° £/)z+/cj+r ^/cr 

k r 

^ ^ ^ ^ Vi+k—aJ+r—b Wkr (j/ * ^'ji—aj—b 

k r 

= (T a ,b(y*w)) — (r a , b oc(y)\ . 

\ J ij \ / zj 


The previous resuit States that if the input is affected by a translation, 
then after the convolution, the output is also affected by the same transla¬ 
tion. Hence, since many input image features, such as corners, edges, etc., 
are invariant by translation, they will stili be present in the output of the 
convolution layer. 


Remark 16.3.4 Pooling is invariant to small translations of data, see section 
15.2, that is T(T a ^oy) = V(y). This property is compatible with the transla¬ 
tion equivariance of convolution, fact that makes pooling and convolution to 
be applied together. This is, if pooling is applied after convolution, we have 

VoC(T a< b°y) = V o C(y). 




522 


Deep Learning Architectures, A Mathematical Approach 


Conversely, if convolution is applied after pooling, then 


C°V(T a:b oy) = C o V(y). 


16.4 Convolutional Layers with 1-D Input 


A convolution layer resembles a fully-connected layer, the difference being 
that it has lots of zero weights and repeating nonzero weights. Consider a 
neural network with the input given by the compact supported signal x — 
x\, X 2 ,..., x n \ and let the sliding kernel be w — [uq, 11 ) 2 ] - The lag by which 
the kernel slides with respect to the signal is called stride. In Fig. 16.2 a and 
b the strides are s — 1 and s — 2, respectively. In both cases, the neurons in 
the second layer have a sigmoid activation function. 

In the case of Fig. 16.2 a the network output can be written in the familiar 
form, Y — cr(WX + B), where X — (aq,. .., xq) t , B = (6, 6, 6, 6) T , Y = 
( 2 / 1 ,... 5 y§) T • The system of weights can be written as the following 5x6 
sparse matrix 



^2 0 0 0 0 \ 

w\ W 2 0 0 0 

0 w\ W 2 0 0 

0 0 w\ W 2 0 

0 0 0 w\ u)2 ) 


We note the repeating weights on each row of the matrix. 


Similarly, in the case of Fig. 16.2 b the network output can be written as 
Y = a(WX + B), where X = (x \,... ,x 6 ) T , B = (b,b,b) T , Y = (yi,y 2 ,V 3 ) T , 
and the system of weights is given by the 3x6 matrix 


/ w\ W2 0 0 0 0 \ 

W = J 0 0 w\ W 2 0 0 ] . 

\ 0 0 0 0 w\ u)2 / 


Convolution with linear neurons Assume now that all neurons of a con¬ 
volution neural net have a linear activation, (f>(x) — x. We shall show that 
this neural network, having several convolution layers, is equivalent to a net¬ 
work with only one convolution layer. Therefore, using a nonlinear activation 
function is essential for deep learning in convolution networks. 

It suffices to show that two convolution layers are equivalent to one con¬ 
volution layer. Consider the forward propagations in two consecutive layers 

x (1) = Ty (1) x (0) + b (1) 
x (2) = vV 2 ) x (1) + b (2) , 





Convolutional Networks 


523 




y l = o(w l x l + w 2 x 2 + b) 



y 2 = o ( w 1 x 3 + w 2 x 4 + b ) 



y 1> = o{w x x 5 + w 2 x 6 + b) 


a b 

Figure 16.2: Two convolution layers: a stride equal to 1; b stride equal to 2. 
All missing arrows are assigned a zero weight. AU neurons have the same 
bias b. 


with the weight matrices of sparse type. The composition provides 

X (2) = W {2) (W {1) + £ (1) ) + B {2) = WX (0) + B, 

with weight matrix W = and bias vector B = W^B^ + B^ 2 \ 

We need to show that the matrix W is also of sparse type. We shall discuss 
this on the previous two cases represented in Fig. 16.2 a, b. 

The convolution layer represented in Fig. 16.2 a has stride s — 1 and 
uses only two shared weights, uq, rc 2 (the support width is 2 ). We consider 
two layers of this type as in Fig. 16.3 a. Then Y\ depends on Xi,X 2 5 X 3 ; Y 2 
depends on X2, X3, X4, and so forth. The convolutional network is equivalent 
to the two-layer net given in Fig. 16.3 b. The new net depends on three 
sharing weights, zq, 1 / 2 , v 3 (the support width is 3), with the stride is 1. Both 
networks in Fig. 16.3 a, b satisfy the information relations 

6(Yi) c 6 (Xi,X 2 ,X 3 ), 6(y 2 ) C 6(X 2 ,X 3 ,X 4 ) 

6(F 3 ) C 6(X 3 ,X 4 ,X 5 ), 6(F 4 ) C 6(X 4 ,X 5 ,X 6 ). 

This can be stated by saying that the receptive field of Y\ consists of the 
units Xi, X 2 , X 3 , unlike the case of a fully-connected layer, when the receptive 
field would consist of all previous neurons. Conseqnently, a convolution layer 
passes less information than a fully-connected layer. 

On the other side, the convolution layer represented in Fig. 16.2 b has 
stride s — 2 and uses only two shared weights, W 2 (the support width is 2 ). 
We consider two layers of this type as in Fig. 16.4 a. The convolutional net is 


524 


Deep Learning Architectures, A Mathematical Approach 




Figure 16.3: Two equivalent convolution networks: a one-hidden layer convo- 
lutional net with stride 1 and support width 2; b two-layer convolutional net 
with stride 1 and support width 3. 



a b 

Figure 16.4: Two equivalent convolutional networks: a one-hidden layer con¬ 
volutional net with stride equal to 2; b two-layer convolutional net with stride 
equal to 4- 



Convolutional Networks 


525 


equivalent to the two-layer net given in Fig. 16.4 b. The new net depends on 
four sharing weights, zq, zq, z/ 3 , z /4 (the support width is 4), with the stride 1. 
Both networks in Fig. 16.4 a, b satisfy the information relations 

6(n) C &(X 1 ,X 2 ,X 3 ,X 4 ), &(y 2 ) C e(x 2 ,x 3 ,x 4 ,x 5 ). 

In particular, the receptive field of Y\ consists of Xi, X 2 , X 3 , X 4 . 

Remark 16.4.1 The convolution compresses the information. If s denotes 
the stride, then + s, i.e., at each layer the number of neurons 

decrease by the stride number. Since s is a small number, the compression is 
smaller than in the case of pooling. 


16.5 Convolutional Layers with 2-D Input 

Convolutional networks have shown excellent performance in the case of Pro¬ 
cessing 2-dimensional images. In this case each input is a colored image (in 
RGB format), which can be seen as a tensor of type r x c x 3, where r is 
the number of rows and c the number of pixel columns in the image. This is 
equivalent to 3 channels of dimension r x c, each channel corresponding to 
one color. Before getting any further, we shall discuss first the convolution of 
the spatial slice of an image with a given kernel. 

In the following we shall convolve a 2 x 2 convolution kernel given in 
Fig. 16.5 a with a 3 x 3 input image, which is illustrated by the matrix 
in Fig. 16.5 b. The convolution kernel is overlapped on the matrix and 
moved horizontally and vertically, in all possible positions. In each position 
we sum the prodncts of the kernel entries and the matrix entries, the numbers 
obtained being an output. The kernel overlap starts from the top left of the 
image and slides by one pixel to the right. Then we continue the operation 
for the bottom row, as follows: 

1.2-1.1 + 2*4+1.3 = 12 

1.1-1.1 + 2.3 + 1.5 = 11 
1*4 — 1*3 + 2*7+1 *6 = 21 
1*3 — 1*5 + 2 *6 + 1*0 = 10. 

Using the convolution operator, *, the previous computation writes as 



12 11 \ 
21 10 y 


2 1 1 
4 3 5 
7 6 0 


526 


Deep Learning Architectures, A Mathematical Approach 


1 

-i 

2 

1 


2 'y 

1 

1 

4 

3 

5 

7 

6 

0 

• 


12 

N 

* 

11 

21 

* 

* 

* 

10 




a 


b 


Figure 16.5: The convolution operatiori: a A 2 x 2 kernel; b the input and 
output of a convolution. 


We notice that the convolution between a 2 x 2 kernel with a 3 x 3 image is 
an output of size 2x2. In general, if the kernel is of size h x k and the image 
is H x iF, the output has the size (. H — h + 1) x ( K — k + 1). For the case of 
an arbitrary stride, see Exercise 16.9.11. 

In the previous example the spatial slice of the image has been convoluted 
with a 2 x 2 kernel. This operation is also called a feature map , since this 
procednre can detect different sorts of features, depending on the kernel used, 
such as horizontal or vertical edges, corners, etc., as the reader can see in the 
exercise section. 

Each color channel can be convoluted separately with the same or a dif¬ 
ferent kernel. It is possible, however, to convolute all 3-color channel image 
with a tensor of order three, i.e., with a sequence of 3 kernels of the same 
type. In this case the convolution operation is defined in a similar way. The 
kernel is overlapped on the image at the location (0, 0, 0) and then we take 
the sum of the products of their entries to obtain the first output. Then the 
kernel is moved by one pixel from top to bottom and then from left to right 
to complete the operation. 

For a fixed kernel, the entries of the tensor are addressed by the 
triple indexed entry X ^ k , with 1 < i < 1 < j < c^\ and 1 < k < 3, 

see Fig. 16.6. One feature map in the l?th layer, for the kth channel and 
corresponding to the given kernel and bias is given by 




(t-i) 




7 /; v; , lU) 

,j+r,fc ^prk ' 


where the activation function is usually taken to be a ReLU in order to 
avoid vanishing gradients. 

Since k represents the color channel, then 1 < k < 3. The £th layer of the 
CNN, iW, is given by the collection of all tensors of order 3 of the form X^ k 
corresponding to all kernels If denotes the number of feature maps 

























Convolutional Networks 


527 


k = 3 
k = 2 


j 



i 


Figure 16.6: The entries of the tensor G jgyWxc^) X 3. The entry xf^ k rep- 
resents the activation in the spatial location (i,j) situated in the kth channel. 


in the £th layer (the number of kernels used for that layer), x are the 
dimensions of the image in the £th layer (rows times columns), then we can 
write the output of the £th layer as an order 4 tensor G xc ^ x3x f^\ 

Customarily, the sequences and are decreasing with respect to 
L This is due to the fact that processing with convolution layers tends to 
decrease the image dimensions with a number equal to the stride. 1 (Besides 
this, using pooling between convolution layers also compresses the dimension 
by a certain factor). The number of channels remains the same, but the 
number of features, increases with £. 

If the inpnt layer is denoted by then the first hidden layer, X^\ 

contains features of the layer X^ 0 ). The second hidden layer, X^ 2 \ contains 
features of X^, i.e., features on features of the input. In general, each layer 
contains features of the previous layer. At the end, a fully-connected layer 
is introduced to put together all the information from the last convolution 
layer, which contains a large number of features. Then a softmax layer can 
be employed if the network is meant for classification purposes, see Fig. 16.7. 


1 In order to avoid this dimension reduction, one can use the trick of padding with zeros. 



























528 


Deep Learning Architectures, A Mathematical Approach 


N neurons 



input channel convolution 

layer 


(2) v (2)^ A 2) 

r Xc X/ q 


> O dog 

O-• cat 

softmax 

convolution output layer 

layer 

FC layer 



Figure 16.7: CNN with two convolution layers, one fully-connected layer and 
a softmax layer (output layer). 


16.6 Geometry of CNNs 

In Chapter 13 we have associated a manifold with each neural network. The 
network weights and biasses were coordinate Systems on the associated man¬ 
ifold. Its dimension is equal to the nnmber of network parameters, which can 
be expressed in terms of layers size by formula (13.3.8) 

d (0) d (1) + d (1) d (2) + • • • + + • • • + d (L_1) d (L) + N. (16.6.2) 

The fact that in a CNN the weights and biases are shared among neurons 
reduces substantially the nnmber of total parameters of the network, and 
consequently, the dimension of the associated output manifold. This fact 
has regularization conseqnences, and hence the CNN is usually not prone to 
overktting the training data. 

We shall work the comparison on two examples at hand. In the case of 
the CNN shown in Fig. 16.3 a, the dimension of the associated neural net is 
given by2 + 2 + 5 + 4= 13 (four weights and nine biases). If keeping the same 
number of neurons in the layers, this network is replaced by a fully-connected 
layer net, then formula (13.3.8) yields the dimension 6-5 + 5- 4 + 5 + 4 = 59. 

As a second example, consider the CNN shown in Fig. 16.4 a. Then the 
dimension of the associated neural net is given by2 + 2 + 4 + 2 = 10 (four 
weights and six biases). The associated fully-connected layer network has an 
associated manifold with the dimension 6*4 + 4*2 + 4 + 2 = 38. We note that 
in both cases the dimension of the neural manifold is substantially larger in 
the case of a fully-connected layer network than in the case of a similar CNN. 
The same phenomenon holds in general for all CNN networks. 
































Convolutional Networks 


529 


16.7 Equivariance and Invariance 

Convolutional neural networks are able to detect local patterns in an image 
regardless of their position. This is because CNNs ensure equivariance to 
translations, see Proposition 16.3.3. This means if the input image is trans- 
lated by some vector, then the activation pattern in each higher layer of the 
network is also translated by the same vector. Therefore, a key component 
of the CNN success in image recognition tasks is dne to their equivariance 
property. 

The next level of abstraction is to replace the set of translations by any 
group of transformations of the input and explain the symmetries of the net¬ 
work parameters by the network equivariance with respect to the considered 
transformations. 

Applications of group theory in neural networks can be found in Ravan- 
bakhsh et al. [100], Kondor and Trivedi [66], Kondor [65], Cohen et al. [26, 27], 
and Bartok et al. [13]. In the following we shall use ideas from the aforemen- 
tioned papers to briefly discuss this new emerged direction of research. 

16.7.1 Groups 

The next definition defines an algebraic structure that will be useful shortly. 


Definitiori 16.7.1 A group is a set G endowed with a compositiori law G x 
G -G G denoted multiplicatively and satisfying the following properties: 

(i) 9192 e G, for any g u g 2 E G; 

(H) 91 ( 9293 ) = ( 9192 ) 93 , for any gi,g 2 ,93 € G; 

(iii) there is a unique element e G G such that for any x G G; 

(iv) for any g G G there is g -1 G G such that gg -1 = g~ x g — e. 

Property (i) States that G is closed with respect to the group law, while 

(ii) says that the group law is associative; (iii) indicates the existence of the 
neutral element in the group; the existence of the inverse element is given by 
(iv). 

If the order of elements in the group law composition does not matter, 
9 i92 — 929i for any gi,g 2 £ G, then the group G is called commutative. 
Depending on the number of elements, the group G can be finite or infinite. 

Any subset H of G, which forms a group under the same law as G, is 
called a subgroup and is denoted by H < G. 

Example 16.7.2 The set of integers, G = Z, endowed with the addition 
operation forms a commutative group. The inverse of n is — n and the neutral 


530 


Deep Learning Architectures, A Mathematical Approach 


element is e = 0. Similarly, the integer lattice, G — Z x Z, with addition on 
components 

(rai, n 2 ) + (mi, m 2 ) = (ni + mi, n 2 + m 2 ) 

forms also a commutative group. Its neutral element is (0,0). 

The set II — 3Z = {3m; m G Z} forms a subgroup of G, while K — 
{(2z,2j);z, j G Z} forms a subgroup of Z x Z. 

Example 16.7.3 Let v G R 3 be a vector and define the translation r v : R 3 -G 
R 3 by ry(x) = x + v. The set G — T(R 3 ) = {r v ; v G R 3 } forms a group with 
respect to functions composition, called the translations group on R 3 . We 
have r v o r u — Ty+u and (ry) -1 = T- v . The neutral element is to = /d, the 
identity transformation of R 3 . 


Example 16.7.4 The 2x2 matrix 



— sin 9 
cos 6 


represents a counterclockwise rotation of the plane R 2 about the origin. The 
set of these rotations, 50(2) = {Ro]0 G R}, form a group under the matrix 
multiplication, called the special orthogonal group of R 2 . We can easily check 
that RqRq' — Ro'Re — Rq+q’ and R^ 1 — R-q. 

The set H — {Ro,R n }, formed by the identity transform and the 180- 
degree rotation (or flip about the origin) forms a subgroup of 50(2). 

Also, the set K — {i?o, R^/a, Rtt/2i ^ 371 -/ 4 } forms another subgroup of 
50(2), formed by the 90-degree rotations about the origin. 

Example 16.7.5 The set G — R 3 together with the composition law 


(xi, x 2 , xs) O (yi, y 2 , y 3 ) = (xi + yp x 2 + y 2 , x 3 + y 3 + x\ y 2 ) 


forms a group, called the three-dimensional Heisenberg group. This is not a 
commutative group. The inverse of an element is given by 

(xi, x 2 , X 3) -1 = (—Xi, —x 2 , —X 3 + XiX 2 ). 

The neutral element is e = (0, 0, 0). 


16.7.2 Actions of groups on sets 

Definition 16.7.6 Let G be a group and M be a set. An action of G on M 
is a mapping a : G x M —> M such that: 

(i) a(gg' , x) = a(y, a(g' , x)), Vy, g' G G and x G M ; 

(ii) «(e, x) = x, Vx G M. 



Convolutional Networks 


531 


We say that the group G acts on the set M with action a and the action 
of the element g on x is a(g,x). Thus, part ( i ) States that the action of the 
product gg' on x is the composition of action of g with the action of g' on x. 
Part (ii) States that the neutral element action is the identity map. 

Any action of a group G on the set M, given by a : G x M —)► M, 
produces a family of transformations of M as follows. For any hxed ^gG, let 
T g : M -G M be defined by T g x — a(g , x). The set of these transformations, 
{T g ; g G G}, forms a group by function composition. This follows from the 
properties of the action, which imply T g T g i — T gg / and T e x — x. We also have 
the inverse transformation (T g )~ 1 — T g ~ i. 

For a given element x G M, the set O x — {T g x]g G G } is called the orbit 
of x. If y, z G O x are two elements in the orbit of x, then there are g, g' E G 
such that y = T g x and z — T g >x. If let u — (/g -1 , then z = T u y , namely, 
z E O y . In fact, it can be shown that O x — O y — O z . 

Definitiori 16.7.7 An action a : G x M —> M is called transitive if for any 
two elements x, y G M , there is g G G such that y — T g x. 

Equivalently stated, the action a is transitive if for any x, y G M we have 
O x = O y . In fact, the action o is transitive if and only if it has only one 
orbit, namely, M — O x , for all x G M. 

We say that M is a homogeneous space of G if for any x, y G M there is a 
g G G such that y — a(g , x). This is equivalent with the fact that the action 
has only one orbit. 

Example 16.7.8 Let M — R 3 and G — T(R 3 ), the group of translations in 
the space R 3 under the operation of vector addition. Then G acts on M as 
follows: if x G R 3 is a vector and g — r v is the translation of vector x, then we 
define the action of g on x by a(g,x) — ry(x), which is the vector obtained 
from x by adding the vector v. More explicitly, we have 

a(g, x) — t v (x) — x + v. 

The reader can easily check the action defmition properties. The action is 
transitive, as any element x G R 3 can be translated into any other element 
y G R 3 . 

Example 16.7.9 Consider M — R 2 and G — SO(2 ), the group of rotations 
about the origin of the two-dimensional Euclidean plane R 2 under the oper¬ 
ation of composition. The group G acts on the set M as follows: if x G R 2 
is a vector and g — Rq is the rotation of angle 0, then we define the action 
of g on x by a(g,x) — Rqx , which is the vector obtained from x under a 
counterclockwise rotation of angle 6 . More explicitly, we have 



532 


Deep Learning Architectures, A Mathematical Approach 


a 


(. 9,x ) 


( cos 6 
y sin 6 


— sin 6 
cos 6 


X\ 

%2 


x\ cos 9 — x 2 sin 6 
x\ sin 0 + x 2 cos 6 


This action is not transitive. The orbit of an element x E R 2 is the circle 
centered at the origin and having radius \\x 


O x = {y G M 2 ; ||y| 



Example 16.7.10 In this example M — R 3 and G — (R 3 , o) represents the 
three-dimensional Heisenberg group. Then G acts on M as 


a(g, x ) = (gi + x lf g 2 + x 2 , 53 + ^3 + 51 ^ 2 ), 

where g — (51, 52, 53) and x — (xi,X2,X3). The easiest way to check that a 
is an action is to notice that 


0 ( 5 , x) — L g x , (16.7.3) 

where L g is the left translation on the group (77, o), namely, L g x — g o x. 
Then 

a(g\ a(g , x)) = a(g', L g x ) = L g iL g x = L^x = x). 

The second property of the action is obviously satished since the neutral 
element in the Heisenberg group is e = (0,0,0). It is worth noting that an 
action of a group on itself, a : G x G —> G, dehned by relation (16.7.3) is 
called action by left multiplication. The action of the Heisenberg group on 
the space R 3 is transitive, since for any x, y G M 3 we have T g x = 5 , with 
5 = 0/1- 52 ^ 2 , 53 - x 3 + xix 2 - 51 ^ 2 ). 

The last example brings up the situation when a group G acts on the set 
M — G, which is the group itself; in this case the action acts as a : G xG —» G. 
If we define 0 ( 5 , x) = xg -1 , then a is called the action by left multiplication. 
If consider 0 ( 5 , x) — gxg~ x , then a is called the action by conjugation. 


16.7.3 Extension of actions to functions 

We have seen how does an action ol : G x M —> M indnce for each group 
element g E G a transformation on M, T g : M —> M. Now we shall extend 
this transformation to functions on M. We denote by J-(M) — {/ : M —> R} 
the set of real functions on M . For any element g E G we define the transform 
: J-(M) -G J-(M) by T y / = /', with 


f\ T g( x )) = / 0 )> 


x G M. 








Convolutional Networks 


533 


This can be written equivalently as 

(' T g f)(x') = f(Tg-i(x ')), Vx'eM. 

We need this concept since the activations of each layer of a neural network 
are considered as functions. For instance, each MNIST data is considered as a 
function defined on a set M of dimension 28 x 28 with integer values between 
0 and 255. 

Example 16.7.11 Let Z represent the set of integers and consider the group 
G = (Z x Z, -h) with addition on components. This acts on the lattice NI — 
Z x Z by the action a : G x M -G M 

«((51.52), (xi,x 2 )) = (51 + £1,52 + x 2 ), Vgi,Xi € Z. 

The induced transform on M is 

T (9i,92)( X ^ Xt z) = (ffl + X ^92 + X 2 ), V(xi,x 2 ) G Z X Z. 

The associated extended transform on F(Z x Z) is given by 

(91,92) «/*) (^15 ^2) f (T(gi ,g2 )~ 1 (*^15 ^2)) f (^1 ^15 ^2 ^2) 5 

which is the composition between the function / and the translation of vector 
-(ffi>02). 

Example 16.7.12 In the case of Example 16.7.9, the induced transform on 
R 2 is given by 

= Oi cos 9 — X 2 sin 0, xi sin 0 + x 2 cos 0), 

which is a clockwise rotation of angle 0. The extended transform on functions 
is 

(Tr 0 /)(x) = /(T h -ix) = f(x 1 cos (9 + x 2 sin0, x 2 cos0 — xi sin#). 

0 

16.7.4 Definitiori of equivariance 

Consider a group G acting on two sets, M\ and M 2 , with actions 

ol\ \ G x AI\ —y AI \, tr 2 i Gr x A7 2 —^ A7 2 . 

Then for any g G G the actions induce the transformations 

T g : M\ -G T g : Tf 2 -G Tf 2 . 


534 


Deep Learning Architectures, A Mathematical Approach 


Consider the extended transforms to functions 

T g : J-(Mi) -G J-(Mi), : F(M 2 ) -G F(M 2 ). 

A map 4> : J^Mi) —> J r ( M 2 ) is called G-equivaricmt if for any group element 
g G G we have 

*(J g f) = Ty$(/)), V/ € 

These concepts can be applied to neural networks as in the following. We 
consider M\ and M 2 be the set of neurons in the input and output layers of 
a feedforward neural network, respectively. The activations of these layers, 
which are the input x^ 0 ) and the output x^ L \ are functions dehned on M\ 
and M 2 , respectively. The function 4> = which maps x^ 0 ) into x^ is 
the input-output mapping of the network. The equivariance property of the 
network with respect to the group action G becomes 

f w ,b( T fl * (0) ) = T;(/^(i<°))) = T' g {x^), Mg e G. (16.7.4) 

This States that the output x^ is transformed in a predictable way as we 
transform the input x^ 0 ) within the family of transforms indnced by the group 
action. 

The equivariance of a network can be dehned on each layer as in the 
following, see Kondor [ 66 ]. Let N be a feedforward neural network with L +1 
layers and layer activations . If Mu denote the set of neurons 

in the £th layer, then x^ G Assume there is a group G, which acts 

on the sets Mo,..., M^. The indnced transforms on ^(Mo), ... are 

correspondingly denoted by T®, ..., T( l ). The neural network A f is called a 
G-equivariant feedforward network if, when inputs are transformed by x^ 0 ) —> 
T^(x^ 0 ^), then the layer activations transform by x^ -G T^(x^) for any 
g e G. 

It is worth noting that this defmition holds for both cases, when the set 
of neurons, Mg, are either discrete or continuous. 

16.7.5 Convolution and equivariance 

The prototype example for equivariance is given by convolutional networks, 
which are equivariant to the actions of the translations group. With the 
previous notations this means 

(T^x®) * w = T^(x® * re), \/g G G, 

for any hlter re, see Fig. 16.8. This relation can be shown in both discrete 
and continuous cases. For the sake of simplicity we consider just the one- 


Convolutional Networks 


535 


dimensional case. The discrete case follows from 


[(T> (0) ) 


* U) 


p 


= X](T gX i0) )i+pWi = Y 


.(0) 
'i+p-g 


X- >X_ n Wi 


= (x (0) * w)p-g = [T' (x (0) * w)\p. 


The verification of equivariance in the continuous case is given by 

(T g x^)*w](t) = / (T g x^)(u + t)w(u) du — / x^\u + t — g)w(u) du 

J M J M 

— (x® * w)(t — g) — [T^(x® * re)](t). 

The previous computations can be carried over to groups as in the following. 
First, the convolntion of any two functions /, i/; : G —> R. dehned on a discrete 
group G is dehned by 

(/ * = Y f(y)g( rl y)- 

yeG 

The verihcation of the G-equivariance relation is similar with the previous 
computation 

(t 9 x (0) * w)(t) = Y{ 1> 9 xW )(y) w ( t ~ 1 y) = Y x ^(y~ 1 y) w ( t ~ 1 y) 

yeG yeG 

= x^\v) w{t~ l gv) — x^(x) w((g~ 1 t)~ 1 v) 


veG 


veG 


= ( t 9 ( x (0) *w)j(t) 


where we used the change of variables v — g 1 y and the fact that the variable 
v G g~ l G — G. 


The equivariance theory can be extended to continuous compact groups, 
see [66]. The convolution dehnition of any two functions /, i/j : G -G R in this 
case is 

(/*W) = [ f{y)g(t~ 1 y)dy 1 (y), 

Jg 

where fi is a left-translation invariant measure on G, with /i(G) = 1, called the 
Haar measure on G. The verihcation of the equivariance is done similarly with 
the discrete case, replacing the sums by integrals and using the invariance 
property of the Haar measure 

(TgX^ *w)(t) [ (T gX^)(y)w(t~ 1 y)d/x(y) = f x^\g~ 1 y)w(t~ 1 y)d/x(y) 

JG JG 

— f x^\v)w{t ~ 1 gv)d/i{v) — f x^\v)w{{g~ l t)~ 1 v)d/i{v) 
Jg Jg 

= ( T ^ (0) *w))(t). 






536 


Deep Learning Architectures, A Mathematical Approach 





*W 




'f 





Figure 16.8: The equivariance can be seen as a commutative diagram. 


It is worth noting that Cohen et al. [26] applied the technique for analyz- 
ing spherical images. They adapted the definition of convolution from planar 
domains to the sphere 8 2 = {x E R 3 ; \\x ! = 1} by 


(f*ip)(R) = [ f(x)ip(R 1 x)dx, 

Js 2 


where ip, f : 8 2 —> R are two spherical signals, and R E 50(3) is a rotation. 
They used the action of the special orthogonal group S0( 3) (namely, the 
group of 3 x 3 matrices that preserve distance and have determinant 1) on 
the sphere 8 2 to prove the rotation equivariance. 

Even if a planar convolution is always equivariant to the actions of the 
translations group, it is not covariant with respect to the rotations group 
£>0(2), unless some additional assumptions are made. This is accomplished 
in the following. 

Let R E SO( 2) be a rotation of the plane and /, w : R 2 —> R. Then 


[(T R f)*w](x) = y^(T R f)(y)w(y - x) = ^2 f(R 1 y)w(y - x) 

y^Z 2 y^Z 2 

— f{u) w(Ru — x) — f(u) w[R{u — R~ 1 x)^j 

u^Z 2 u^Z 2 

= E /(“) ( t r- iw )( u - R ~ lx ) 


u 


ez 2 


= 1 ®) = T R (f *T r -iw). 


If now we consider / to be a signal and w a hlter, and further assume that w 
is a rotation-invariant hlter, namely, T r -iw — re, then the previous formula 
becomes 

[(Xr/) * w](x) = T R (f * T r -iw) = T R (f * w), 

which represents the covariance of the planar convolution with respect to 
rotation-invariant hlters. 











Convolutional Networks 


537 


In the case of the group G — Z 2 , we consider the rotations R to be a 90, 
180 and 270-degrees rotation of the lattice Z 2 about the origin. The rotation 
invariance of the filter, w(Rx) — w(x), in this case becomes 

'WiJ —i5 j) G Z . 

16.7.6 Definitiori of invariance 

The invariance is a particnlar case of equivariance. In this case the network 
output does not change when the input is transformed by the family of trans- 
formations {T g \g G G }. In this case formula (16.7.4) becomes 

fw,b(^ g x W ) = fw,b(x (0) ) = £ (l) , Vg G G. 

This means T' g — Id for all g G G, where Id is the identity map of T(Ml). 
Then for any / G J-(Ml) 

Ggf){ x ) = f{ T g- lW)=/W, Vx G M L . 

Therefore, (x) = x for all x G Ml and g £ G. This is equivalent to the fact 
that the orbit of each element in Ml is the element itself, namely, O x — {x}, 
for all x G m l . 

A prototype example of network invariance to local translation occurs in 
the case of pooling, see section 15.2. We shall discuss next the pooling process 
in the context of groups. 

Let G be a group and / : G G 1 be a feature map. 2 Let U C G be 
a subset of G, which contains the neutral element, e G t/, called pooling 
domain. The max pooling operator, P, is dehned by Vf : G -G R 

x) — max f(u ). 

A distinguished case is the one when the pooling domain U is a subgroup 
H < G. Then the pooling domains {xi7; x G G} form a partition of the 
group G, namely, if x, y G G, then xi7 = yH or xi7 n yH — 0. Furthermore, 
if G is a finite group, the H and xH have the same number of elements, 
see Exercise 16.9.12. The partition sets {xi7;x G G} are called cosets and 
represent the equivalence classes of the following equivalence relation on G 

x ^ y x~ Y y G H. 


2 If/ : M —>• M then / is an activation map, since it describes the activations of neurons 
in the set M. However, if / : G -e- R, then / is a feature map since it describes the features 
captured by the group G. 






538 


Deep Learning Architectures, A Mathematical Approach 


Pooling induces an application (j) on the coset space defined by 


<KxH) = (Vf)(x). 


Example 16.7.13 This is a classical example of max pooling using a 2 x 2 
pooling domain, which shifts across the two-dimensional lattice of integers. 
For this, let G — Z x Z be a group with addition on components and consider 
the activation / : G —> R. Consider the pooling domain 


U = {(i,j);-2<i<2, —2 < j < 2}. 

The pooling becomes ( Vf){x ) = max f(u), where 

x + U = {(z, j); -2 + xi < i < 2 + xi, -2 + X 2 < j < 2 + X 2 } 


is the translation of the domain U with the vector (xi,X 2 ), X{ G 
If the pooling domain is a subgroup of G 


H = {(3ni, 3 ^ 2 ); (ni, 712 ) G Z x Z} 


then the maximum is taken over the shifts of x by multiples of 3 pixels, which 
corresponds to a stride of s — 2 pixels. 


The max pooling invariance in the groups context is shown in the next 
resuit, [27]. 

Propositiori 16.7.14 Pooling commutes with the group action: 

V(T g f)(x) = T g (Vf)(x), Vx,geG, f:G^R. 

Proof: Making the substitution u — g~ l h and using the definit ion of T g we 
have 


V(J g f)(x) 


max(T g f)(h) 
hexu 


— max f(q 1 h) 

hexU v 7 


max f(u) = Vf(g l x) 

ueg 1 xU 


Tg(Vf)(x). 


Remark 16.7.15 We have considered for the sake of simplicity only activa- 
tions and hlters that have only one channel. The results can be extended to 


Convolutional Networks 


539 


multiple-channel case by considering f,w:M —> R^, where K is the number 
of channels. The convolution can be defined in the discrete case as 

K 

( f*w)(g) = ^2^2fk(y) Wkig^y) = '^2(f(y),w(g~ 1 y)) 

yeG k =l yeG 

and in the continuous case by 


f K 

(f*w)(g)= / 22fk(y)w k (g 1 y)dn(y) 

jG k =i 


' ( f{y),w(g l y))dy{y ), 

G 


where fi is the Haar measure on the compact group G. 


16.8 Summary 

CNNs are neural networks specialized in processing image data. They have 
been extremely successful in practical applications such as handwritten dig- 
its recognition. A CNN contains convolution layers which use convolution 
with a kernel instead of a regular matrix multiplication. The kernel is usu- 
ally a matrix of relatively small dimension, compared with the input image 
dimensions. 

In a CNN layers have sparse interactions and share weighs and biasses. A 
local receptive field matching the dimensions of the kernel slides one (or more) 
pixel(s) at a time, first horizontally and then vertically, from top to bottom, 
until the entire input data is completely scanned. The sum of the prodncts 
between the kernel entries and the activations of the receptive field produces 
a number, which is stored as an entry in a feature map. The convolution 
between the input data and each kernel produces a different feature map, 
which retain certain features of the data, such as corners, edges, or other 
simple shapes. 

A certain number of kernels are used to produce feature maps. In the 
case of a multiple-layer CNN, the later layers consist of features on features. 
The convolution layers are used alternatively with pooling layers. At the end, 
the CNN contains one (or more) fully-connected layers and then a softmax 
activation for the output layer. 

Convolutional neural networks are able to detect local patterns in an 
image regardless of their position, because CNNs are equivariant to transla- 
tions. This means if the input image is translated by some vector, then the 
activation pattern in each higher layer of the network is also translated by 
the same vector. The equivariance property can be defined for any group of 
transformations acting on the layers of a neural network. The theory is done 
for both discrete groups and continuous compact groups. 



540 


Deep Learning Architectures, A Mathematical Approach 


16.9 Exercises 

Exercise 16.9.1 Compute the following matrix convolution: 



Exercise 16.9.2 (a) Formulate and prove a variant of Proposition 16.3.3 for 
the case of one-dimensional inputs. 

( b ) Does the eqnivariance to translation stili hold true if the input is a tensor? 

Exercise 16.9.3 (Sobel operators) (a) Show that the convolution of an 
image with the 3x3 kernel 

1 2 1 \ 

0 0 0 
-1 -2 -1 ) 

emphasizes horizontal edge detection. 

( b ) Show that the convolution with the transpose kernel, K T , klters the 
vertical edges. 



Exercise 16.9.4 Show that the convolution of an image with the following 
3x3 kernels blur the image: 

(a) (Box blur) 


K = 


( b ) (Gaussian blur) 


K = 


1/9 

1/9 

1/9 ' 

1/9 

1/9 

1/9 

1/9 

1/9 

1/9 

1/16 

1/8 

1/16 

1/8 

1/4 

1/8 

1/16 

1/8 

1/16 


Exercise 16.9.5 Provide arguments supporting the fact that the convolution 
of an image with the 3x3 kernel 



sharpens the image. 


Convolutional Networks 


541 


Exercise 16.9.6 Consider the following 3x3 convolution kernels 

/ 1 0 -1 \ /0 1 0 \ 

K x = [ 0 0 0 , K 2 = 1 -4 1 

\ -1 0 1 / \ 0 10 / 



Show arguments supporting that the effect of convolution with these kernels 
is edge detection. Note that the sum of the entries in each kernel is 0. What 
does this mean? 

Exercise 16.9.7 Consider the following convolution image processing: each 
pixel valne in the original image is subtracted from its neighboring pixel value 
on the left. Write a 3 x 3 kernel which does this operation. 

Exercise 16.9.8 Let y be a two-dimensional discrete signal. If C and V 
denote, respectively, the convolution and pooling operators, show the fol¬ 
lowing relations: 

(a) V oC(T afi o y) = V o C(y). 

(b) CoV(T a , b oy)=CoV(y). 

Exercise 16.9.9 If one would like to regularize the CNN given in Fig. 16.7, 
in which layer should the dropout techniqne be used? 

Exercise 16.9.10 Explain why between two networks with the same input 
and the same depth and width, a CNN is less prone to overhtting than a 
fully-connected neural network. 

Exercise 16.9.11 A convolution layer compresses the information, in the 
sense that the output size is smaller than input size. 

(i) In the case of a one-dimensional convolution layer, if N denotes the size 
of the input, using a hlter of size F that is moved with a stride we obtain 
an output of size O, given by 



N-F 
S 


+ 1- 


If padding is used, denoted by P, then the formula becomes 


N-F + 2P 

+ 1 . 


O 


S 




542 


Deep Learning Architectures, A Mathematical Approach 


(ii) In the case of a two-dimensional convolution layer, if the input dimension 
is Wi x H\ and F is the size of a square filter, then the output has dimension 
W 2 x H 2 , with 




F + 2P 
S 




Hi - F + 2P 
S 


+ 1 . 


Exercise 16.9.12 Let H be a subgroup of the group G and x G G. Denote by 
xH — {xh; h G H}. Show that the subsets xH have the following properties: 

(i) x G xH\ 

(ii) If x, y G G, then either xH — yH , or xH n yH — 0; 

(Hi) If G is a finite group, then H and xH have the same number of elements; 
(iv) Consider the relation dehned on G x G, dehned by x ~ y 
y _1 x G H . Show that ” satisfies: 

(a) x ~ x 

(b) x ~ y => y ~ x\ 

(c) x ~ y and y ^ z x ^ z. 



® 

Check for 
updates 


Chapter 17 

Recurrent Neural Networks 


A feedforward fully-connected neural network cannot be used successfully for 
modeling sequences of data. A few basic reasons are the following: they can¬ 
not handle variable-length input sequences, do not share parameters, cannot 
track long-term dependencies, and cannot maintain information about the 
order of input data. Hence the need of a neural architecture that can handle 
successfully all the previous requirements. This is a recurrent neural network, 
or RNN (Rumelhart et al. [108]), which is the subject of this chapter. 


17.1 States Systems 

States Systems are easier to understand than RNNs and will constitute a 
basis for later discussions on RNNs. This is the reason why we shall start by 



Figure 17.1: a. A dynamical system isolated from exterior, given by equation 

(17.1.1 ) . b. A dynamical system driven by the process Xt as given by equation 

(17.1. 2 ) . 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_17 


543 











544 


Deep Learning Architectures, A Mathematical Approach 


considering a dynamical System, whose state ht at time t updates as 

h t = f (h t -v, 9), t> 1 (17.1.1) 

where the transition function f : M, k x R m —> R k is a Borel-measurable 
function and 9 G R m represents a vector parameter, which is independent 
of t, see Fig. 17.1 a. We assume the States h t are random variables (discrete 
or continuous) and denote by Ht — S(/q) the sigma-algebra generated by ht 
(see the defmition in section D.5 of the Appendix). Each state ht contains 
some information, Ht, about the dynamical System. Since the state ht is 
determined by ht- 1 , Proposition D.5.1 implies Ht C Ht-i, i.e., we obtain a 
descending sequence of sigma-algebras. Namely, each state contains less or 
the same information as the previous state does. 

If /(•; 9) is invertible in the hrst argument, then Ht = Ht- 1 , namely, the 
information does not shrink. However, this condition is too restrictive, and in 
general the information decreases strictly as t increases. In the following we 
shall consider a similar example involving an application of the contraction 
principle, Theorem 7.7.1. 


Example 17.1.1 Let’s assume that / is differentiable and satisfies the inequal- 
ity \\df/dh\\ < A < 1. Then by the Mean Value Theorem the function / 
becomes a A-contraction in the hrst argument 


/(M) - f(h'-9 )|| < max 

h 


dhf || \\h-ti\\ < A|| h-ti 


Vh , h’ € R 


k 


For any two States in the sample space, u;,a/ G O, we shall evaluate 
the distance between their corresponding States at time t by unfolding the 
recurrence as 


h t (u) - h t (u') || \\f(ht-i(cj)] 9) — /(/it-i(a/); 9)\\ < \\\h t -i(u) - h t -i(u;') 

< A 2 ||/i t -2M - h t - 2 ( 0 ;')|| < • • • < A^/ioM - ho(v')\\. 


Since the random variable ho is hnite almost everywhere, then \\ho(u) — 
ho(u;')\\ is hnite for almost all uo and uJ. Taking £ —^ 00 and using the Squeeze 
Theorem, we obtain 


lim 

t—tOG 




for almost all cj and lj'. This implies that lim ht — c, constant, almost every- 

t —too 

where, and this convergence holds almost surely, i.e., P (uj\ht(u) —> c) = l. 
In fact, c is the hxed point of the contraction function /, i.e., 


f(c;0) = c. 



















Recurrent Neural Networks 


545 


It is obvious that c depends on the parameter 6. In this example the sequence 
of sigma-algebras Ut is strictly descending, with its inferior limit Uoo — 
a>i Ht — ©(c) — {0, fi}, which is the trivial sigma-algebra. Hence, in the 
long run, the System losses ali its information. Forgetting information about 
the past is a characteristic of dynamical Systems with contractive transition 
function. This will constitute the cause of the vanishing gradient problem for 
recurrent neural networks, which will be approached later in this chapter. 

Since in general the information cast by Ut decreases as t increases, we 
shall consider additional information inserted into the System at each time 
instance. Thus, we consider a dynamical System driven by an external signal 
modeled by the stochastic process X t ,t> 1. The recurrence in this case can 
be written as 

h t = f(h t -i,X t -9), t> 1, (17.1.2) 

see Fig. 17.1 b. Since ht is determined now by both ht- 1 and X t , we obtain 
Ut C &(ht-i, Xt). Using the defmition of the joint sigma-algebra of two 
random variables 

&(ht-i,X t ) = &(e(ht- 1 ) u 0(X t )) = 6(Ht -1 u It), 

the previous inclnsion becomes 

U t C 6(Ht_iU2t), (17.1.3) 

where Xt denotes the input information generated by the input variable X t . 

Propositiori 17.1.2 Consider a dynamical system with a given initial state 
ho and driven by a stochastic process Xt as in (17.1.2). Then 

U t C 0(Zi U • • • U X t ). (17.1.4) 

Proof: The proof follows from a repetitive application of formula (17.1.3) 
and Exercise 17.10.1, which implies 

U t C 6{U t -iUX t ) 

C © [&{Ut -2 U Xt- 1 ) U X^j — © (Ut- 2 U Xt- 1 u x t ^j. 
Indnctively, we obtain 

Ut c &(u 0 UZi U • • • UZt). 

Since ho is a given constant, then Uo — {0, and hence 


Uo U X\ U • • • U Xt — X\ U • • • U Xt. 


546 


Deep Learning Architectures, A Mathematical Approach 



Figure 17.2: A u many-to-many v configuration of an RNN. 


Remark 17.1.3 It is worth noting the use of dependent variables. Unfolding 
we have 


ht = f(h t -i,X t ) = f(f(h t . 2 ,X t ^),X t ) = g(h t ^ 2 ,X t ^,X t ), 

with g measurable. Since ht is determined by ht- 2 , At_i, and At, by Propo- 
sition D.5.1 we obtain &{h t ) C &(ht- 2 , X t -i, X t ). 


17.2 RNNs 

A recurrent neural network can be introduced in a couple of ways. 

1. First, it can be considered as a state System having the property that 
each state, but the initial one, provides an outcome, see Fig. 17.2. Thus, the 
state ht is fed information from the previous state, ht- 1 , and from the present 
input, At, and provides an outcome Yt and an input to the next state, ht+i- 
The state ht is called the tth hidden state of the RNN. X t and Y t are the tth 
input and output, respectively. 

2. We can also think of an RNN as a sequence of plain vanilla feedforward 
neural networks, X t ^ ht ^ Y t , (with input layer At, hidden layer ht and 
output layer Yt), who feeds information from one hidden layer to the next. 

The Standard equations for the forward propagation in an RNN are given 

by 


ht = tanh(VF/ij_i + UXt + b) (17.2.5) 

Y t = Vht + c. (17.2.6) 


Thus, the hidden States take values between —1 and 1 and the output is 
an affine function of the current hidden state. The transition function / is 












Recurrent Neural Networks 


547 


a composition between the hyperbolic tangent and an affine fnnction. The 
parameter 6 — (IT, [/, 5, c) consists of two matrices and two vectors. Matri¬ 
ces W and [/, represent the hidden-to-hidden state transition and input-to- 
hidden state transition. The matrix V represents the transition from hidden 
state-to-output, while b and c are bias vectors. 

17.3 Information in RNNs 

We introdnce some terminologies ffist. The input information generated by 
X t is denoted by If The hidden information is generated by the hidden 
state, T~Lt — and the output information is generated by the output, 

Et = This section deals with the relation between these sigma-algebras. 

The ffist resuit shows how the output information relates to the input 
and hidden information. 

Propositiori 17.3.1 Consider an RNN with a given initial state ho and 
inputs X t satisfying the forward pass (17.2.5)-(17.2.6). Then 


£ t cH t C &{X\ U • • • U It). (17.3.7) 

Proof: From relation (17.2.6), the variable Y t is determined by /it, so by 
Proposition D.5.1 we have &(Y t ) C ©(/it), which shows the ffist inclusion of 
(17.3.7). The second inclusion follows from Proposition 17.1.2. 

■ 

It is natural to consider the case when the information Xt inserted into 
the system at each time step is “new”, namely the case when the input 
variables X tl t E {1,2,...} are independent. We obtain the following history 
independence property. 

Proposition 17.3.2 Consider an RNN with a given initial state ho and inde¬ 
pendent inputs X t satisfying the forward pass (17.2.5)-(17.2.6). Then both 
sigma-algebras Et-\ and TLt-i are independent from It (namely, Y t ~\ and 
ht~ i are independent from X t ). 

Proof: Since X t is independent of Xi, ..., Xt- 1 , then I t is a body of infor¬ 
mation independent from {Ii, ... It-ij- Then their generated sigma-algebra, 
6(Zi U • • • UZt-i) is independent of It, too. Since by Proposition 17.3.1 we 
have TLt-1 I 6 (Ii U • • • U It-i), it follows that TLt-i is independent of It. 
The fact that Et -1 is independent of 1 1 follows from the inclusion Et C TLt, 
see Proposition 17.3.1. ■ 

In order to asses the information using entropy, we shall assume that 
variables /it, X t , and Y t have the same vectorial dimension, so t/, V, and W 


548 


Deep Learning Architectures, A Mathematical Approach 


become square matrices. We shall further assume that they are also nonsin- 
gular. In this case we can solve for ht from (17.2.6) as ht — V^iYt ~ c). This 
implies via Proposition D.5.1 that TLt C £t- Since the inverse inclusion has 
been shown in Proposition 17.3.1, it follows that TLt — £t- 


Proposition 17.3.3 Assume that variables ht, X t , and Y t have the same 
vectorial dimension and (X t )t>i are independent random variables. 

(a) The conditional entropy of one hidden state with respect to the previous 
state satisfies 


H(ht\ht-i) < H(X t ) + ln | det U 


( b ) The conditional entropy of an output with respect to the previous hidden 
state satisfies 

H(Y t \ht-i ) < H(X t ) + ln | det(UV)\. 


Proof: (a) Let at — Wht -1 + UX t + b, so ht — tanha^. We evaluate the 
conditional entropy using Proposition 12.1.2 


H(h t \h t -i) = H(t&nh(a t )\ht-i) 


= H(a t \h t -i) -E^^t-ipnjdet J t anh -i(fc t )|]. (17.3.8) 

We shall compute next the Jacobian. Since at = tanh _1 (hf), and tanh -1 acts 
on components, then aj = tanhso 


7 ih \-( da t\ _ /9tanh l {h\) 
J ^~ l{ht) ~ \ dh Vj, k ~ \ dh i» 


Si 


ik 


dhf' 


j,k 1 - (h\) 


2 


Since the Jacobian is diagonal, it follows that 


det </t an h 1 


(A.)=n 


i - (HT- 


Substituting back into (17.3.8) yields 


H{ht\ht-i) = H{at\ht-\) + ~E Pa ^ ht ~ 1 [ln JJ |1 — {h\) 2 \\ 


^ H(^at | 


(17.3.9) 


since 1 — ( h\) 2 E (0,1). We evaluate next the conditional entropy H(at \ht~i) 
as 


H(at\ht-i) = H(Wht-i + UX t + b\h t -\) 

= H(UX t \h t -i) = H(UX t ) = H(X t ) + ln | det U 













Recurrent Neural Networks 


549 


We have taken into account that Xt and ht ~i are independent and then 
used Proposition 12.1.2. Substituting back into (17.3.9) yields the desired 
inequality of part (a). 

(5) Using Yt = Vht + c, a similar computation provides 


H(Y t \ht-i) = H(h t \h t -i) + ln | det V 

< H(X t ) + ln | det U | + ln | det V\ 


where we have used the inequality from part (a). 


Remark 17.3.4 If the entries of matrices U and V are small enough, their 
absolute value of determinants is smaller than 1. Under this regularization 
condition we obtain the following upper bounds: 

H(ht\ht-i) < H(X t ), H(Y t \ht -i) < H(X t ). 

This asserts that the entropy of the driving signal, Xt, is larger than the 
entropies of the output and hidden state, conditioned by the history of the 
hidden states. 

We also note that in the virtue of Corollary 3.5.2 the conditional entropies 
H(ht\ht—i) are controlled by the input variance, UarXt, in the sense that 
a small variance of Xt implies a small value for the conditional entropy 
H[ht\ht- 1). 


17.4 Loss Functions 


Consider an RNN with T cells. We shall denote the inputs and outputs 
by (Xi,..., Xt) and (Yi,..., Yt), respectively. The target in the case of a 
“many-to-many” RNN, which is represented in Fig. 17.2, is given by the 
sequence (Zi,..., Zt)- The loss function, L, for an RNN represents the u dis- 
tance” between the seqnences (Yi,..., Yt) and (Zi,..., Zt)- This can be 
considered as the sum 

T 

t =1 

where the individnal loss function, L t measures the proximity of Y t with 
respect to Z t . 

For the individual loss function Lt we have a few choices. In the case of 
random variables, the individual loss can be either the mean square error 


L 


t 


E[|y t - z t 


2 ^ 


5 







550 


Deep Learning Architectures, A Mathematical Approach 


either the Kullback-Leibler divergence 


L t - D KL (p^x 1 ,...,x t ),Zt\\Pe-xx 1 ,...,Xt),Zt) 


or the cross-entropy 


L t = S(p( Xl ,...,X t ),Z t ,P0;(X 1 ,...,X t ),Zt) = ~ EP(Xl . Xt) ’ Z *[log P0 i (X 1 ,...,X t ),Z t ], 

where 6 — ([/, V, W, b , c) is the parameter of the model. 

In the case when the variables are continuous we may choose the individ- 
ual loss as the square of the Euclidean distance 

L t = l -(Y t - Z t f. 

Regardless of the loss function considered, computing the gradient with respect 
to parameters, V^T, is an expensive operation and leads in many cases to 
gradient problems, as we shall see later. 


17.5 Backpropagation Through Time 

The forward pass through an RNN is given by formulas (17.2.5)—(17.2.6), 
which provide valnes for ht, Y t , and losses L t . Now, in order to minimize 
the loss function, it suffices to apply the gradient descent method, which 
requires the computation of the gradient \7qL. The method of computing the 
gradient is called the backpropagation through time , and it is a variant of the 
backpropagation method studied in earlier chapters. However, this is more 
complex now, since it is a composition between a backpropagation at each 
individual time step and a backpropagation across time. 

Since the exposition of the general case contains complicated notations, 
which might be potentially confusing, we shall exemplify the method of back¬ 
propagation through time in the case of an RNN with only two steps, see 
Fig. 17.3. In this case there are two inputs Xi,X 2 , two outputs Ti,l 2 , as 
well as two hidden States, /ii, h 2 . For the sake of simplicity we shall consider 
them one-dimensional continuous variables. The forward pass eqnations can 
be written as 


ai = Who + UX^b 
h\ — tanhai 
Fi = Wii + c 


a 2 = Wh 1 + UX 2 + b 
h 2 — tanh a 2 
Y 2 = Vh 2 + c. 


These formulas are used to compute the loss function. For simplicity, we 
consider L t = \(Y t — Z t ) 2 , so the loss function becomes 


L = L x + L 2 = l(Yi - Z x f + ^{Y 2 - Z 2 f = 


1 

2 


Y - Z 


2 

Eu’ 








Recurrent Neural Networks 


551 



Figure 17.3: An RNN configuration with two steps. 


We need to compute five gradients 


V*L = ( 


dL OL OL OL OL 


dw dv’ dU ’ db dc 


We shall compute beforehand a few derivatives that will be useful shortly. 
Using the chain rule, we have 


dh 


i 


^ tanh(ai) = sech 2 ai ^ 1 


dW dW v ~ x/ A dW 

— (1 — tanh 2 ai)ho — (1 — h\)h$. 


A similar computation can be used to obtain 


dh 2 

dW 

dhi 

db 

dh 2 

db 


(1 - hl)hi 
1 -h\ 

1 — h^. 


Since the dependence 
rule yields 


of L\ on h\ is done only through Yi, then chain 


dLi dL\ dY\ 
dh\ 


dY\ dh\ 


(U -Z x )v. 




552 


Deep Learning Architectures, A Mathematical Approach 


Similarly, 


We also have 


dh 2 

dh\ 


fi=^ 


-- LCLllll Uj‘2 - OCUi Li 2 - 

oh\ oh\ 

(1 - tanh 2 a 2 )W = (1 - h 2 2 )W. 


Another application of chain rule provides 

W = ^tanha 2 = (l-/i|)^ = (l-/i|)X 2 . 

Now we are well prepared to compute the gradients of the loss function. 
We shall start with the gradient with respect to V. Since the loss Lt depends 
on V through Y^, we have 

dL _ dLi dL 2 

dV ~ ~dV + ~dV 

dLi 0Y\ dL 2 dY 2 

dYi dV + W 2 ~dV 

r\ r\ 

= (Y l -Z l ) — {Vh l + c) + {Y 2 -Z 2 ) w {Vh 2 + c) 

= (n - Z x )h x + (Y 2 - Z 2 )h 2 = YjXt - Zt)h t . 

t 

The derivative with respect to c is computed similarly 

dL _ dLi d L 2 

dc dc dc 

dLi dYx dL 2 dY 2 

Wi dc + dY 2 ~dc 

/Q o 

= (Yi - Zi) — (Vhi + c) + (Y 2 - Z 2 ) — (Vh 2 + c) 

oc oc 

= (Yi - Zi) + (t 2 - z 2 ) = J2( Y t - z t)- 

t 

When computing the gradient with respect to W we take into account that 
L\ depends on W only through h\, while L 2 depends on W through both h\ 
and h 2 (there are two edges containing W). An application of the chain rule 



Recurrent Neural Networks 


553 


provides 


dL 

dW 


dL x dL 2 
dW + dW 

dL ] dh\ dL 2 dh 2 dL 2 dh 2 dh x 

m^dW + ~dh^dW + ~dh2 dh x dW 
(Yi - Z 1 )V(1 - h\)h Q + (Y 2 - Z 2 )V( 1 - h\)h! 
+(Y 2 - Z 2 )V{ 1 - hl)W(l - h\)ho. 


Since there are two vertical edges involving U, a similar computation can 
be applied to obtain 


dL 

dU 


dL\ dL 2 
~dU + ~dU 


dL x dh x 

dhx dU 


dL 2 dh 2 
+ dh 2 dU + 


dL 2 dh 2 dhx 
dh 2 dhx ~dU 


(Yi - Zx)V( 1 - hj)Xx + (Y 2 - Z 2 )V( 1 
+(Y 2 - Z 2 )V(1 - h 2 2 )W( 1 - h\)Xx. 



The last gradient 


is computed with respect to the bias b 


dL 

db 


dL\ + dL 2 


db 


db 


dL\ dh x dL 2 dh 2 dL 2 dh 2 dhx 
dhx db dh 2 db dh 2 dhx db 
(Yi - Zx)V(l - h\) + (Y 2 - Z 2 )V( 1 
+(Y 2 -Z 2 )V(1- hl)W(l-hf). 



Similar formulas can be obtained for an RNN with more than two hidden 
States. 


17.6 The Gradient Problems 

The following difficulties in dealing with gradients in an RNN have been 
pointed out first in Hochreiter [55] and Bengio et al. [14, 15]. 

Vanishing gradient problem We notice from the previous gradient formu¬ 
las that the gradients with respect to Vb, U, and b involve products involving 
matrices W, V , as well as the factors (1 — h%) and (1 — h\). Given that 
ht — tanhat G (—1,1), then 1 — h^ G (0,1). Therefore, a product of these 
type of factors have the effect of decreasing the gradient. The longer the 
RNN, the more products involving (1 — h^) will be, and then the smaller the 
gradient. 
































554 


Deep Learning Architectures, A Mathematical Approach 


The factor involving the matrix W in the gradient formula comes from 
the derivative In the case of an RNN of length T there are more products 

involving derivatives of type , which will have the effect of producing 


a power W T 1 . If the matrix W has eigenvalues \\i <1, then the power 


will have eigenvalues |A^| i_i < 1. This follows from the eigenvalues 
decomposition W — MDM 1 , which implies W T ~ l — MD T ~ l M l , where D is 
the diagonal form having eigenvalues along the diagonal. Since for T large we 

0, then D t ~ 1 —> O, and hence W T_1 tends to the zero matrix 


|T-1 


have \X lt 

when the length of the RNN increases, see Proposition G.1.2 in the Appendix. 


The previous discussion can be summarized under the notion of vanishing 
gradient problem. There are several remedies for this problem, which prevent 
the gradient from shrinking in a larger or a smaller extent. 

(i) One solution to the problem is to change the activation function. Since the 
factors (1 — ht) come from the derivative tanh / (a^), one idea is to replace the 
activation function with one whose derivative is not everywhere less than 1, 
such as ReLU , which is equal to 1 just on positive activations. 

(ii) Another solution is to initialize the weights to the unit matrix, W — I. 
This fact will prevent the weights to shrink to zero too fast, since the eigen¬ 
values of the powers W p will stay closer to 1 for a larger nnmber of iterations. 

(iii) The most robust hx of the vanishing gradient problem is to employ a 
novel architecture involving “gated cells”, such as LSTM or GRU cells. We 
shall deal with this type of architecture in the next section. 


Exploding gradient problem We assume now the matrix W has one eigen- 
value satisfying |A^| > 1. Since W p — MD p M l , then D p has the entry A^ 
tending to infinity for p large. Therefore, some entries of W p will tend to 
infmity, a fact that leads to the gradient exploding problem. One hx is to 
initialize the weights W to be the identity matrix, hoping for preventing the 
weights to explode too soon. Another useful technique is gradient clipping , 
which is based on rescaling back the gradients when they become too large, 
see Mikolov [84] and Pascanu et al. [96]. 


17.7 LSTM Cells 

The Long Short-Term Memory network, LSTM, has been introduced in 1997 
by Hochreiter and Schmidhuber [56]. This a type of RNN, which contains 
special cells that are capable of learning long-term dependencies. They use the 
concept of gates that optionally allows information passing through a sigmoid 
layer and a pointwise multiplication. The functionality of an LSTM involves 
three types of gates: forget , update , and output. It is based on introducing 





Recurrent Neural Networks 


555 


an internal cell state, Ct, and introducing an inner loop in each cell. The 
description of each gate is as follows. 

1. The forget gate is a sigmoid layer that is used to forget the irrelevant 
history informat ion. It is dehned by 

ft = <T(Wfh t - 1 + U f Xt + bf), 

where Wf,Uf are matrices and bf is a bias vector. Since a is a logistic sigmoid 
function, then ft E (0,1). This represents the fraction of the past state that 
will be forgotten. It depends on the last state ht ~i as well as the present 
input, X t . 

2. The update gate selectively updates the internal cell state value, Ct, as 

C t = ftCt-i + itC u 

where it is a scale factor dehned by the sigmoid layer 

it — a(Wiht~ i + U{Xt + &i), 
with matrices W{, U{, and bias bi, and 

C t = tanh(W c /it_i + U c X t + b c ) 

represents a candidate that could be added to the internal state, which 
belongs to the interval (—1,1). The product itCt represents a fraction of 
the candidate, which updates the internal state Ct . The term ftCt-i repre¬ 
sents how mnch is left after the fraction ft was forgotten from the internal 
state value. 

3. The output gate provides the value of the hidden state 

h t = o t tanh(O), 

as the product between a factor that decides how mnch to output, 


Ot — cr(W 0 ht-i + U Q Xt + 6 o) 5 

and a state between —1 and 1 obtained by applying tanh function to the 
internal state, Ct- 

The use of LSTM cells in RNNs help with solving the vanishing gradi- 
ent problem, since LSTM cells allow gradients to how unchanged. However, 
they stili suffer of the exploding gradient problem. There are other architec- 
tures, which serve similar purposes such as peephole LSTM, [42], peephole 
convolution LSTM, [113], GRU cells [24], etc. 


556 


Deep Learning Architectures, A Mathematical Approach 



Figure 17.4: A deep RNN with T — 4 and L — 2. 


17.8 Deep RNNs 


We have seen that an RNN can be seen as a sequence of one-hidden layer 
parameter-sharing feedforward neural networks, whose hidden layers exchange 
information by a transition function. The first idea to introduce depth in an 
RNN appeared in Graves [50] and was shortly followed in Pascanu et al. [95]. 

A deep RNN is obtained by unfolding a deep parameter-sharing feedfor- 
ward neural network. In this case the hidden state of the RNN, h\ , has two 
indices. The lower index, t, represents the position of the state across time, 
while the upper index, 7, provides the layer number the state belongs to, see 
Fig. 17.4. A deep RNN with dimensions T x L means an architecture with L 
horizontal layers, 1 < £ < L and T vertical unfoldings, with 1 < t < T. 

Training deep RNNs is harder and more expensive than training simple 
RNNs. We shall provide a supporting argument for this. Consider the deep 
RNN with 2 layers and 4 recurrences given in Fig. 17.4. In the process of 
finding the gradient of the loss function with respect to W we need to com- 
pute, among other derivatives, the derivative dY^/dW. Since Y 4 depends on 

W through the intermediate variables h\ , 1 < t < 4, 1 < £ < 2, an appli- 


cation of the chain rule yields products involving two types of factors: 


dh 


(0 


dh 


(0 

t-i 


dh 


( 2 ) 


and wrfry- The one i n f r °d uces a multiplication by W , as we have seen 


dh 


in the computations of section 17.5. We look now to the second factor. If we 
























Recurrent Neural Networks 


557 


assume the layers are related by a logistic sigmoid function as 

hf ] =a(W h h ( t 1 ) +b h ), 

then using the properties of the sigmoid we obtain 

W h <T'(W h hl 1] + b h ) = W h a{W h h^ + b h )( 1 - a{W h h^ + b h )) 

Since a' < 1/4, this term also induces a shrinkage effect on the derivative. 
The deeper the RNN, the more prominent this effect will be. All the afore- 
mentioned effects lead to a more pronounced vanishing gradient problem. 

17.9 Summary 

RNNs are neural networks specialized in processing sequential data such as 
audio and video. They have been proved very successful in practical appli- 
cations such as speech recognition, handwritten generation, text generation, 
machine translation, image captioning, unconstrained handwritten recogni¬ 
tion, etc. 

An RNN is obtained as a finite sequence of a parameter-sharing feedfor- 
ward neural network of the same architecture, whose hidden layers exchange 
information by a transition function. RNNs train by a variant of the back- 
propagation method, called backpropagation through time, which is more 
expensive than the regular FNN training. 

Plain vanilla RNNs are plagued by two problems: vanishing gradient and 
exploding gradient. There are several partial remedies to these problems. 
The exploding gradient descent can be improved by gradient clipping and 
initialization by unit matrix, while the vanishing gradient problem can be 
fixed by employing a new type of RNNs architectures that include gates 
cells, such as LSTM, GRU, etc. 

RNNs containing several horizontal layers are called deep RNNs. They 
are useful in problems regarding more complicated pattern extraction from 
the raw sequence data. 

17.10 Exercises 

Exercise 17.10.1 (a) Let Qi, Q 2 , and Q 3 be three sigma-algebras. Show that 

&{Gi U Q 2 U Q 3 ) = & (^ 5 (Gi U G2) U £/3^. 

(6) Formulate and prove a generalization. 




558 


Deep Learning Architectures, A Mathematical Approach 



Figure 17.5: A u 2-to-l” RNN for Exercise 17.10.3. 


Exercise 17.10.2 (a) Consider the matrix 



Show that lim W n — O 2 , where O 2 denotes the 2x2 zero matrix. 

n —^00 

( 1 b ) Let A be a k x k symmetric matrix and denote by p(A) = maxi <i<k |A^| its 
spectral radius, where A i denote the eigenvalues of A. Consider the matrix W = 


1 


1 +P(A) 


A. Prove that lim W n = Ok 


n^oo 


Exercise 17.10.3 Consider an RNN with two one-dimensional inputs, Xi, 
X 2 , and one output, Y, see Fig. 17.5. The loss function is L — \{Y — Z ) 2 , 
where Z denotes the target. Find the equations of the backpropagation 
through time algorithm in this case. 

Exercise 17.10.4 Consider the RNN given in Fig. 17.5 and assume the 
inputs X U X 2 are random variables, the initial state ho is given, and the 
target Z is a random variable. Denote the output sigma-algebra by £ — &(Y). 
We say that Z is learnable if &(Z) C £. 












Recurrent Neural Networks 


559 


Which of the following is always true: 

(a) If Z is learnable than &(Z) C &(Ii UZ 2 ). 

( b ) If © (Z) C 6(Xi UI 2 ) than Z is learnable. 


Exercise 17.10.5 Consider a one-dimensional dynamical System whose state 
updates as 

h n = f{h n -r,6), n > 1, 


where the transition function is f(x: B) = tanh(0x), with \0\ < 1. Find the 
hidden state of the System in the long run, lim n ^ 00 h n . 


Exercise 17.10.6 The same question as in Exercise 17.10.5, but replacing 
the transition function by a logistic sigmoid, f(x;9) — a(9x). 

Exercise 17.10.7 The same hypothesis as in Exercise 17.10.5, but replacing 
the transition function by a sine function, /(x;0) = sin(0x). Show that the 
long run behavior of the dynamical System depends on the parameter 9 and 
initial state h q. 




® 

Check for 
updates 


Chapter 18 

Classification 


In classification problems a neural network has to be able to classify clnsters, 
i.e., to assign a label with each cluster. These labeis can be either natural 
numbers, points in the space or vectors, belonging to a label space. The clas¬ 
sification procedure is equivalent to being able to learn a “cluster splitting 
function” or a decision map. The training set provides labeis with each clus¬ 
ter. This assignment dehnes a decision map. The network will be able to 
classify the testing data by learning this decision map, i.e., to state to which 
cluster testing points belong to. 

18.1 Equivalence Relations 

Consider the hypercube, I n — [0, l] n . Any subset S C I n x I n is called a 
relation on I n . The following properties will be useful shortly: 

(i) S is called reflexive if it contains the hypercube diagonal {(x,x);x E I n }. 
This States that \/x E I n then (x, x) E S. 

(ii) S is called symmetric if it is symmetric with respect to the diagonal 
{(x,x);x E I n }. This means that if (x,y) E 5, then also (y, x) E S. 

(iii) S is called transitive if (x, y) E S and (y, z) E <S, then (x, z) E 5. The 
geometric interpretation of this property is that S is closed to rectangles 
which have one of the vertices on the diagonal. More precisely, if (x, y) E 5, 
then there is a unique rectangle with one vertex at this point and with another 
vertex at (y,y), see Fig. 18.2 a. Transitivity reduces to the fact that if three 
of the rectangle vertices, (x,y), (y, y), (y,z), belong to the set 5, then also 
the fourth vertex, (x, z), belongs to S. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_18 


561 



562 


Deep Learning Architectures, A Mathematical Approach 




a b 



c 


Figure 18.1: a. Reflexive, nonsymmetric relation. b. Reflexive and symmetric 
relation. c. Symmetric nonreflexive relation. 



a b 

Figure 18.2: a. The rectangle rule: if (x, y), (y, y), (y, z) G S, then (x,z) G S. 
b. Equivalence relation S, which makes a finite partition of I n . 


An example of transitive relation is the lattice of rational numbers in 
[0,1], given by *S = (/„ x I n ) n (Q x Q), as the reader can easily check the 
previous rectangle property. 

The previous relation properties are mutually exclusive, see Fig. 18.1 a, 
b, and c. 

A relation S which is reflexive, symmetric, and transitive is called an 
equivalence relation. See Fig. 18.2 b for an example of equivalence relation. 
The set S in this case is a union of interior of squares along the diagonal. 

Two points x,y G I n are called equivalent under relation S if (x, y) G S. 
Customarily, we write x ^ y. All points which are equivalent to a given x is 
denoted by C x — {y G / n ; x ~ y} and is called the equivalence class of x. The 
set of all equivalence classes, denoted by I n / is called the quotient set. 



























Classification 


563 


A partition of the set I n is a collection of subsets {Ai}i with the following 
properties: 

(i) Ai 7 ^ 0 , for all i; 

(ii) Ai n Aj = 0 , for i j\ 

(iii) |J i Ai = I n . 

If the collection of indices i is finite, then {Ai]i is a finite partition. They can 
be used to classify the points of I n into n distinet classes. 

The relation between partitions and equivalence relations is given by the 
following resuit: 

Propositiori 18.1.1 Let ~ be an equivalence relation on I n . Then there is a 
partition {Ai}i of I n such that: 

(i) for each i, Vx, y E Ai we have x ^ y; 

(ii) Vx, y G I n with x ~ y there is an i such that x,y E Ai. 

Proof: The resuit can be restated by saying that any equivalence relation 
makes a partition of I n and the elements of the partition are the equivalence 
classes of the relation. Let C x be the equivalence class of x. We show that 
the collection {C x } x satisfies the properties of a partition. Since x E C x , 
then obviously C x 0 and \A t(zT C x — I n . It is easy to see that x ~ y is 
equivalent to C x — C y . Assume we have an element in the intersection of 
two distinet classes, z E C x n C y . Then z E C x and hence x ^ z and z E C y 
and z ~ y. By transitivity x ~ y, which implies C x — C y , contradiction. It 
follows that any two distinet classes have an empty intersection. Hence, the 
set {C x } x satisfies the properties of a partition. ■ 

It is worth noting that the converse also holds true: any partition dehnes 
an equivalence relation. If {Afi}i is a partition, then the relation x ^ y if and 
only if there is an i such that x, y G Ai is an equivalence relation on I n . 

The way of obtaining the partition {Aj}j from the equivalence relation S 
using projection can be seen in Fig. 18.2 b. This provides also a visualization 
of the equivalence relation S associated with a given partition: the set S 
consists in the union of rectangles aligned along diagonal constructed from the 
projections Aj. This set S contains the diagonal (is reflexive), is symmetric, 
and satisfies the rectangle rule (is transitive). 

In the next sections we shall associate with a given partition different 
objects, such as entropy, decision functions, labeis, decision maps, etc. 

18.2 Entropy of a Partition 

In this section we shall extend the notion of entropy to a partition. For 
this purpose, we shall consider a probability space (O,^ 7 , /x) and a finite 


564 


Deep Learning Architectures, A Mathematical Approach 


measurable partition A — ( Aj)j< m of the set 14, i.e., a partition with Aj G J~. 
The measure /i can be used to assess numerically the sets Aj. The entropy of 
the partition A with respect to the probability measure fi is dehned by 

m 

H(A,g) — — /i(A/) ln p(Aj). (18.2.1) 

3 = 1 

Since p(Aj) G (0,1), the entropy is positive, H(A,/jl) > 0 . It can be shown 
that the entropy of the partition A is maximum when all the sets in the 
partition have the same measure g(Ai) — • • • = p(A m ) — 

Example 18.2.1 We assume that each element cj of 14 is associated with a 
nonnegative numerical label, such as a weight or a mass, m{uo). The proba- 
bility measure in this case is 

xeA 

where M — /i(l4) is the total mass of 14 and S x denotes the Dirac’s measure 
sitting at x. The number p(A) provides the proportion of mass corresponding 
to the set A. The entropy (18.2.1) represents the uncertainty of dividing the 
set 14 in parts of unequal masses. 

Example 18.2.2 Let 14 C M n be a bounded Borei set. For any Borei set, 
A G /3(14), we define the probability measure p(A) = where A denotes 
the Lebesgue measure on 14. In this case, the entropy (18.2.1) represents the 
uncertainty of dividing the set 14 in subsets of unequal volumes. 


Example 18.2.3 Let /i be a measure absolutely continuous with respect to 
the measure v on the measurable space (14, J 7 ). By the Radon-Nikodym The- 
orem, see Theorem C.7 in Appendix, there is a measurable nonnegative func- 
tion p such that 

p(A) — / p(x)dis(x), 

Ja 

for any measurable set A in 14. Ifp is a density function, i.e., / p(x) dis(x) — 1, 

Jn 

then /i becomes a probability measure. The associated entropy with the par¬ 
tition A and measure /i is 


H{A,n) 



In the particular case when the measures are proportional, i. e., /i = czq with 
c constant, the density function is p{x) — and the previous entropy 




Classification 


565 


becomes 


•' ~1 i 

H(A, n) = -'^ T77 ^v(A i )hi(— 77 ^v(Ai) 


1=1 


v(Vt) 


v(Vl) 


m 


v{yt) z 


Y. v{Ai) ( - ln z/(f2) + ln u(Ai)) 


3 = 1 


= lnI ' (S!) + R!2 ) H[A ’ V) 


which is a relation between the entropies of the partition A with respect to 
two proportional measures. 


18.3 Decision Functions 

Let {Ai,..., Ak} be a finite measurable partition of / n , i.e., a partition with 
Ai Borei sets, Ai E B(I n ), for all i = 1,..., k. A decision function is a mea¬ 
surable function which associates an integer with each class in the partition, 
i.e., f : I n N, f(x ) = j for any x E Aj. We can regard of j as being the 

label associated with the class Aj. Equivalently, / = Xa=i jIa > where 1 a is 
the indicator function of the set Aj. Decision functions are used to classify 
data into classes. Note that Aj — / _1 (j), see Fig. 18.3 b. 

The set {1,2,..., k} is called the label set and the space that contains 
the label set, in this case, M is the label space. The labeis are considered 
consecutive integers just for convenience reasons. 

Example 18.3.1 (case k — 2) Consider two separable clusters of points in 
R n separated by the hyperplane {w T x + 0 = 0}. Then a classical perceptron 
can binary decide on each of the clusters a point belongs to, using the decision 
function f(x) = 1 + H(w T x + 6). The label set is {1, 2} and the label space 
is R. 

Example 18.3.2 Consider a certain attribute of the points in the hypercube 
/ n , such as, for instance, color. Assume there are k possible colors the points 
can get. Then an equivalence relation can be defined on l n : two points are 
equivalent if and only if they have the same color. Let Aj be the set of points 
of jth color. Then the sets {Aj} form a partition of I n and a mapping / from 
I n to the label space {1 ,..., k} defined by f(Aj ) — j is a classification rule. 
Sometimes the function / is called a classifier. We note that in this example 
the sets Aj are not necessary Borel-measurable. 

The next resuit deals with the implementation of a decision function, 
see Cybenko [30]. Since in real life clusters are not perfectly separable, the 






566 


Deep Learning Architectures, A Mathematical Approach 


( 0 , 1 ) ( 1 , 1 ) 




a b 

Figure 18.3: a. Subset D in /2 with the Lebesgue measure of the complement 
small. b. Decision function associated with a partition. 


resuit will take this into consideration. It States that a neural network with 
a single internal layer can implement any decision function such that the 
total Lebesgue measure, A, of the incorrectly classified points can be made 
arbitrarily small, see Fig. 18.3 a. 


Propositiori 18.3.3 Let f be a decision function associated with the nneasur- 
able finite partition {Afi\i on I n and let a be a continuous sigmoidal function. 
For any e > 0, there is a finite sum G(x) = 1 OLj(i(wJx + ^ fi J ufj G R n , 

oy, 0j G R and a set D C I n such that A (D) > 1 — e and 


G{x) - f{x) 


< T 


\/x G D. 


Proof: By Luzin’s theorem (see Appendix, section C.8), there is a continuous 
function g : I n —> R and a set D such that A (D) > 1 — e and g(x) — f(x) 
for all x G D. By Theorem 9.3.8 (or Theorem 9.3.6) the sums of the previous 
form G(x) are dense in C(I n ), so for the previous g G C(I n ) we find a G{x) 
such that | G(x) — g(x) \ < e for all x G I n - Therefore 


I G{x) - f{x) 


G(x) — g(x) 


< e, 


Mx G D. 


Note that this is an existence resuit; the actual construet ion of the func¬ 
tion G(x) (i.e., finding the weights Wj , oy and thresholds 6fi) is a completely 
different problem. 
















Classification 


567 


Remark 18.3.4 To each decision function we can associate an entropy as 
in the following. Given a finite partition, A = (Ai)i, and a decision function 
/, we define the measure /i such that fi(Ai) — \-) an( ^ cons ider the 

entropy H(A, /a) to be the entropy associated to the partition A and decision 
function /. 


18.4 One-hot-vector Decision Maps 

Sometimes, it is more convenient to replace the integer labeis by one-hot 
vectors. For instance, instead of labeis 1, 2, etc., we can associate the one-hot 
vectors e\ — (1,0 ,... ,0) T , e 2 = (0,1, 0,, 0) T , etc. The label set is formed 
by {ei,..., e/e}, while the label space is R n . Thus, we arrive at the following 
dehnition. 

Let {Ai ,... ,^4/e} be a finite measurable partition of / n , i.e., a partition 
with Ai Borei sets, Ai e B(I n ), for all i — 1 ,...,&. A one-hot-vector deci¬ 
sion map is a measurable function / : I n R fc , which associates a one- 
hot vector with each class in the partition, i.e., f(x) — ej for any x E Aj, 
ej = (0,..., 1,..., 0) T . In this case, the label associated with the class Aj is 
a /c-dimensional unit vector, and all these label vectors form a basis in R fc . 

What is the advantage of nsing one-hot vectors as labeis? In the case when 
the labeis are just integers, the set I n is mapped into the real line and this 
provides some localization for the testing sets around some given integers. 

In the case of one-hot vectors as labeis, the set I n is mapped into a higher 
dimensional space, R fc . This provides more room for the testing sets to be 
pooled toward linearly independent directions, leading to a better separation 
of classes. 

Choosing the one-hot vectors ej as labeis is just a convenience. We may 
choose as labeis any other k vectors that are linearly independent in R fc , 
eventually organized as an orthonormal basis. Equivalently, instead of con- 
sidering k independent vectors, we may consider the labeis to be k points in 
R fc , denoted by P\, P 2 , • • •, Pfc, whose position vectors are linearly indepen¬ 
dent. 

The label space can have a smaller dimension than k, which is the num- 
ber of classes in I n . For instance, in Fig. 18.4 we consider a partition of I n 
into k — 4 classes, and associate to each class a label point in R and R 2 , 
respectively. 

The next two linear algebra results state the relation between using labeis 
as either points or one-hot vectors. 

Propositiori 18.4.1 Consider k distinet points, P \,..., P^, in R fc . Then there 
is a linear function f : R fc —> R fc such that f(ej ) = Pj, for j = 1,..., k. 



568 


Deep Learning Architectures, A Mathematical Approach 



A 

A 

i 

5. '-p, 

— 

A 7 



_ 


^ ° p 

1 4 

2 

A 4 



a b 

Figure 18.4: Decision map associated with a partition: a. The label space is 
R. b. The label space is R 2 . 



Figure 18.5: Points in a general position: a. Two distinet points on a line. 
b. Three noncollinear point in th eplane. c. Four noncoplanar points in the 
3-dimensional space. 


Proof: Let Vj — (uj,..., Vj) T be the coordinate vector of the point Pj in R fc , 

so we can write vj — Yli=i v ) e i- Then the linear function f{x) — Wx , with 
the matrix Wij — v l - is the desired function. ■ 

The converse is not necessarily true. To make this work, we need to impose 
an extra condition, which will be introduced next. 

First, we note that through 2 distinet points pass a unique line, while 
through 3 noncollinear points pass a unique plane. Through 4 noncoplanar 
points pass a unique 3-dimensional hyperplane, and so on, see Fig. 18.5. 
If k points are situated in a general enough position, the dimension of the 
hyperplane determined by them is k — 1 ; otherwise the dimension of the 
hyperplane is strictly smaller than k — 1 . 

Definitiori 18.4.2 The points P \,..., P^ in R fc are said to be in a general 
position if there is no hyperplane of dimension less than k — 1 containing 
them. 

































Classification 


569 



Figure 18.6: The points Pi,...,P& are in a general position if the vectors 
PoPj are linearly independent. 


Equivalently stated, the lowest dimension of the hyperplane containing the 
given points is k — 1 . Another equivalence statement is given in Exercise 
18.11.2. The next resuit is a useful characterization of this concept using 
vectors. It roughly States that changing the origin of the space, the position 
vectors of the points become linearly independent. 

Propositiori 18.4.3 The points Pi,..., P& in M. k are in a general position if 
and only if there is a point Pq in M. k such that the vectors PoPj cire linearly 
independent in j G {1,... ,k}. 

Proof: “ ” Assume the points Pi,..., P& are in a general position. Then 

by Exercise 18.11.3, there is a unique (fc — l)-hyperplane, P, which contains 
the points. Then choosing any point Pq ^ Tt yields the linear independent 
vectors PoPi, • • •, PoP/c, see Fig. 18.6. To show that we form a vanishing linear 
combination 

k 

Y,ci*>Pi = 

i—1 

and show that C{ — 0. Using the vector decomposition T\yPi = PqPi 
we can write 


k k k 

2=1 2=1 2=2 


The set {P 1 P 2 , • • • , PiP&} forms a System of independent vectors in P, see 

Exercise 18.11.3. Since Po P, the vector P 1 P 0 is independent of the previous 
system, since it points outside the hyperplane P. Therefore, the previous 
linear combination has zero coefficients, Cj — 0. 

“ <^= ” Let Po G R fc such that {PqPi, ..., PoPk} are linearly independent. 























570 


Deep Learning Architectures, A Mathematical Approach 


If the points {Pi,..., P&} are not in a general position, then they must be 
contained inside of a hyperplane V of dimension p, with p < k — 1. We have 

k v k 

V = {Q <E pA = = !>• 

3 = 1 J =1 

Since {PqPi, ..., PoP/c} are linearly independent, the hyperplane P has dimen¬ 
sion k — 1, which leads to a contradiction. ■ 

The next resuit is the converse of Proposition 18.4.1. 

Propositiori 18.4.4 Consider k distinet points, Pi,..., P& E R fc ; in a gen¬ 
eral position. Then there is a linear function f : M. k —> R fc that f(Pj ) = 

e j0 f or j — 1, • • •, k. The function f is invertible. 

Proof: By Proposition 18.4.3, we can choose a point Pq such that PoP 
are linearly independent. These vectors actually form a basis in RC Let g : 
R fc —> R fc be the unique linear function such that g(PoPj) = ej, j = 1 ,..., k. 
Denote by r the function that assigns to each point P in R fc the vector PqP, 
i.e., r(P) = PqP. Construet the function / by the composition / = gor. Then 
/ is linear, as a composition of linear functions, and satisfies the property 

f(Pj) = ej- 

m 

Proposition 18.4.4 assures the equivalence between choosing labeis either 
as one-hot vectors, ej, or as general form points, Pj, in R fc . We shall deal 
with both cases in future sections. 

18.5 Linear Separability 

A clnster Q of points in R n is a set of n-uples (x \,..., x n ) supposed to have a 
certain individnal identity. Two clusters in R n , Q\ and (?2 5 are called linearly 
separable if there is a hyperplane TL of dimension n — 1, which separates the 
clusters. This means: 

(i) The hyperplane TL divides the space R n in two half-spaces, S\ and S 2 . 
(ii) Each clnster is contained in one of the half-spaces: Q\ C S\ and Q 2 C £2- 
If the hyperplane TL is dehned by the eqnation 

h(x) — a\X\ + • • • + a n x n + d — 0, 

then the separability of Q\ and Q 2 can be written as h(g\)h(g 2 ) < 0, for any 
points q G 61, ^2 ^ This means that h keeps constant opposite signs on 
each of the clusters, see Fig. 18.7 a. 










Classification 


571 



Figure 18.7: a. Linear separability of two clusters. b. Linear separability of 
the convex hulls of two clusters. 


Example 18.5.1 Two clusters in R, Q\ and are separable if there is a 
number a such that (gi — a)(g 2 — a) < 0, for all g\ G g^ G G2- This means 
that either g\ < a < g^ or g^ < a < gi, for all g\ G Qi, g 2 G G2- 

A set K C R n is called convex if for any two points A, B G K, the line 
segment AB is included in the set K. For instance, a disk, the interior of a 
triangle or a tetrahedron, are convex sets. 

In general, a cluster is not a convex set. The convex hull of a cluster Q is 
the set of all convex combinations 

n 

hull(Q) = | XiQf, Xj = 1, Xj > o}. 

gi^Q i=i 

For instance, if a cluster has only 2 points, its hull is the closed line segment 
dehned by the points. If the cluster contains 3 points, its cluster is the triangle 
(including the interior) with vertices at the given points. It can be shown 
that the set hull(Q) is always a convex set, which contains the cluster Q, see 
Exercise 18.11.5 

Propositiori 18.5.2 Two clusters Q\ and Q 2 are linearly separable if and 
only if hull(Qi) and hull(Q 2 ) are linearly separable. 

Proof: “ ” If Q\ and Q 2 are linearly separable, then let TL be the hyper- 

plane that divides the space into two half-spaces, S\ and <S> 2 , which separate 
the clusters, i.e., 

Qi C Si, G 2 CS 2 . 






572 


Deep Learning Architectures, A Mathematical Approach 


Since the half-spaces S\ and <S 2 are convex sets, using the convex minimality 
of the hull, see Exercise 18.11.5, we have 

Qi c hull(Qi) C 5i, 02 C hull(0 2 ) C <S 2 . 

Hence, the convex hulls hull(Q i) and hull{Q 2 ) are separated by the hyper- 
plane T~L. 

A variant of proof, starting direct from the definition, is given in the 
following: 

Assume Q\ and Q 2 are linearly separable and let T~L be the separation hyper- 
plane with the eqnation 

h(x) — a\X\ + • • • + a n x n + d — 0. 


For any two points in the clnsters’ hulls 


9 


i 


F A \g} G hull(G i), 5 2 = E A ^ G hull(g 2 ), 
g\^Qi 2 


using the linearity of the function /i, yields 


K9 l )Kg 2 ) 


h ( E A hOK E A M) 

g\eQi g] eg 2 

( E X i h (9i)) ( E A j /l (S';)) 

g^ eQ 2 

E E ^ A ?&(0i )%i) < 0, 

5^ e ^2 


because Aj > 0, A 2 - > 0 

K9i)K9i) < 0 - 


and we used the clusters separability condition 


“ ” If hull(Qi) and hull(Q 2 ) are linearly separable, there is a hyper- 

plane T~L that divides the space R n in two half-spaces, S\ and <S 2 , such that 
hull(Qi) C S\ and hull(Q 2 ) C S 2 . Using the obvious inclnsions C /m/Z(t/i) 
and Q 2 C hull(02) it follows that 0\ C <Si and 02 C <S 2 , and hence 0\ and 02 
are linearly separable. See also Fig. 18.7 b. ■ 


Even if two clnsters have distinet points, sometimes they are close enough 
to each other such that their convex hulls intersect. In this case the separa- 
bility can be eventually achieved only by a nonlinear function. 




Classification 


573 



Figure 18.8: The separatiori function F of two intersecting clusters is nonlin- 
ear. 


Propositiori 18.5.3 Let Q\ and Q 2 be two clusters in R n such that 

hull(Qi) n hull(Q 2 ) Y 0. 


Then there is no linear function F : R n —> M, p such that F(Qi) and F(Q 2 ) 
are linearly separable. 


Proof: By contradiction, assume there is linear function F : R n —> R p such 
that F(Qi) and F(Q2) are linearly separable, i.e., there is a hyperplane in R p 
dehned by the equation h(x) — Yl P i= 1 a i x i + d — 0, such that 

<H9i)H9i) < 0 


for all g\ G Q\ and gf G t/ 2 , where ^ — h o F. 
Consider an element in the intersection 


g G hull{Q\) H hulliCfY), 


which therefore has two representations 


9 = F X i g i =F X i 9 ^ 


9} e Qi, 9 i e 02- 



574 


Deep Learning Architectures, A Mathematical Approach 





Figure 18.9: The convex separatiori of three clusters. 


Using the linearity of F we obtain the following contradiction: 

o < <&(<?)$(<?) - ^(e a ^) $ ( 51 a ^ 2 ) 

i i 

EW))(E X P^]) 

i j 

hJ 

Therefore, there is no separation function F that is linear, see Fig. 18.8. 


Remark 18.5.4 (i) It is worth noting that there is also no separation func¬ 
tion which is affine, i.e., of the form F(x) — Wx + 5, with W an n x n 
matrix and b E R n . This follows from the fact that separability is translation 
invariant. Consequently, a linear neuron cannot separate two clusters whose 
convex hulls intersect. For this job we should employ neural networks with 
nonlinear activation functions. 

(ii) Two clusters in R n , Q\ and Q 2 are called F-separable if there is an invert- 
ible bi-continuous mapping F : R n —> R n such that the clnster images, F(Qi), 
F(Q2), are linearly separable in R n . A such function F is called a home- 
omorphism of R n . Standard results of neural network theory show that a 
feedforward neural network (with enough hidden layers) can learn the non- 
linear continuous function F. Angmenting the network with a perceptron we 
can perform the foial linear classffication. Hence, a classffication problem is 
rednced to the learning of the continuous nonlinear function F. 

(iii) The role of the function F is to pull apart the clusters, so that they 
can be linearly classihed. However, there are cases when the clusters cannot 
be separated by a homeomorphism of R n . In this case we need an extra- 
dimension to separate the clusters and the continuous function should be 





ClassiEcation 


575 


F : R" - 7 - R p , with p > n. For instance, Q\ = {x E R 2 ; ||a:|| < 1 } and 
G2 = {x G R 2 ; \\x\\ > 2} cannot be separated in R 2 by pulling them apart 
continuously, but they can be separated in R 3 if we pull one of them vertically. 

The case of k clusters Consider a family of k clnsters, 0 = {Q \^..., Q\f\. We 
say that 0 is a linearly separable family if its clnsters are mutually separable, 
i.e., any two clnsters, Qi and Qj, are linearly separable, for i 7 ^ j. 

Remark 18.5.5 (i) By Proposition 18.5.2, the family 0 = {(/ 1 ,..., £/&} is 
linearly separable if and only if {hull(Q 1 ),..., hull(Qk)} is a linearly separable 
family of convex sets. 

(ii) Denote by Gj the center of mass of the cluster Qj. This means that Gj is 
obtained as an average of the elements of the cluster Qj. Applying Exercise 
18.11.7, we obtain that Gj G hull(Qj ), j = 1,..., k. 

18.6 Convex Separability 

Separability can be also considered in a slightly different, but equivalent way. 
A family of clnsters 0 = {Q 1 , ...,£/&} in R n is called convex separable if there 
are k closed balls, B \,..., £>&, in R n such that 

(i) £>i,..., Bk are disjoint; 

(ii) Qj C Bj , for all j G {1,..., k}. 

In particular, two clnsters in R 2 are convex separable if they are included 
in two disjoint disks. For the case of three clusters, see Fig. 18.9. The following 
resuit shows the equivalence between these two types of convexity. 

Proposition 18.6.1 Two clusters Q\, Q 2 in R n are convex separable if and 
only if they are linearly separable . 

Proof: If Q 1 and Q 2 are convex separable in R n , then there are two 

disjoint balls, F>i,F >2 hi R n such that Q\ C B\ and C B 2 . Let TL be 
an (n — l)-hyperplane which separates the disjoint balls B\ and B 2 (this 
hyperplane can be constructed perpendicular to the centers segment, passing 
through a point which is exterior to both balls). Then TL separates the clusters 
Qi and Q 2 . 

a <=” For the sake of simplicity, we shall consider n — 2; similar reasons can 
be carried out for higher dimensions. Assume there is a line £ in the plane, 
which separates Q\ and Q 2 and divides the plane into two semiplanes, S\ and 
<S> 2 , with Q\ C Si and Q 2 C £ 2 - Let M be the closest point on the line £ to the 
cluster Q\. There a small enough e > 0 such that the half-lines starting at M 
and making angles equal to e with the line £ do not intersect hull(Qi), see 






576 


Deep Learning Architectures, A Mathematical Approach 



Figure 18.10: Construction of a disk that contains the convex hull of a cluster 
and is contained in a given half-plane. 


Fig. 18.10. Let /3 be the angle bisector of the angle made by the half-lines. 
Then for any point O E /3, there is a unique circle centered at O, which is 
tangent to both half-lines. When the distance ||OM|| is large enough, the 
cluster Q\ lies inside the ball centered at O. This is the construction of the 
ball B\. Similarly, we can construet the other ball F> 2 , on the other side of 
the line 7, containing the cluster (?2- 


Remark 18.6.2 Even if convex separability looks to be a more general con- 
cept and seems to make more sense, however, it is the linear separability 
which can be tackled using neural networks. 

The next resuit deals with the existence of a one-hot-vector decision map for 
a family of clnsters. 

Propositiori 18.6.3 Let 0 = {(?i,..., QjT\ be a family of k clusters in R^, 
which are linearly separable. Then there is a decision map F : M, k —> M. k with 
F(Qj) = e jf j = l,...,fc. 

Proof: Since 0 = {Q i,..., Q^} is linearly separable, then it is convex sepa¬ 
rable. Therefore, there are k mutually disjoint balls, F>i,..., 13&, in such 
that t/j d 73j 5 for all j. Consider a decision map f : M. k —> M, k which maps 
the jth ball into ej, i.e., f(Bj ) = ej, for j — 1,..., k. This ends the proof. 













Classification 


577 



Figure 18.11: a. Contractiori of a cluster toward a point. b. Contraction of 
two clusters toward two distinet points. 

18.7 Contraction Toward a Center 

Separation of clusters often involves transformations that pull apart clusters 
toward certain points, which are then considered as labeis. 

We shall start with the simplest case of a single cluster, which is pulled 
toward a point, called center. Let Xj be the position vectors of the points 
in a cluster Q and C be a given center, with the position vector c. The 
transformation given by 


Yj = XXj + (1 - A)c, 

with A G (0,1), contracts the cluster Q into a cluster situated in a proximity 
of the point C, see Fig. 18.11 a. The smaller the value of A, the closer the 
image cluster is to the point C. 

Consider now two clusters, Q\ and Q 2 , and two distinet centers, C\ and 
C 2 . Let Xj be the position vectors of the elements of the cluster r, with 
r G {1,2}, and C{ the position vectors of the centers C{. The transformations 

Yj = Ai-Xj + (1 - Ai)ci, 

Yj = \ 2 X] + (1 - A 2 )c 2 , 

push the first cluster into the direction of the center C}, while the second 
cluster toward C 2 , see Fig. 18.11 b. The new clusters are more separate than 
the former. 

The previous equations are written only for the cluster points. The ques- 
tion is whether we can extend them to a global function defined on the entire 
space where the clusters live in. This will be addressed in the next section. 


578 


Deep Learning Architectures, A Mathematical Approach 


18.8 Learning Decision Maps 

Classification of clusters can be achieved by learning certain decision maps. 
This section deals with this topic. 

18.8.1 Linear Decision Maps 

Consider two clusters in R 2 


Q {(^i 5 yi ) 5 • • • 5 5 vn) } 5 G 1 (^r 5 yi ) 5 • • • 5 (fm 5 um )} 5 

and assume that hull(Q ) H hull{Q) — 0. By Exercise 18.11.9, the clusters 
Q and Q are linearly separable. We shall investigate the cases of one and 
two-dimensional labeis. 

One-dimensional labeis We associate two labeis ol and 5 with the clusters 
Q and Q, respectively. The labeis are two distinet real numbers. Since the 
clusters Q and Q are linearly separable, it makes sense to look for a linear 
function / : R 2 —>> R that maps the clnster Q in a neighborhood of ol and the 
cluster Q in a neighborhood of 5, and also hope that the midpoint (a + 2)/2 
separat es the sets f(Q ) and f(G). 

We shall look for a function / to be the input-output function of a feedfor- 
ward neural network with one-dimensional output (one neuron in the output 
layer) and no hidden layers, i.e., we assume it is given by 


f w ,b(x, V ) = w i x + w 2y + b. 

The real parameters Wi, b have to be chosen such that the image sets f(G) and 
f(G) are tightly localized about a and 5, respectively. This can be achieved 
by taking 

(re, b ) — argminF(rc, 6), 

where F is the following quadratic cost function that measures the distance 
between labeis and images: 



The minimum is realized for the solution of the equation VT(ic,6) = 0. 
In the following we shall compute the gradient VF = (d wl F,d W2 F,dbF). A 
straightforward differentiation using chain rule yields 


Classification 


579 


N M 

d Wl F = Yxjjwixj + W 2 Vi + b — a) + ^^Xj(wiXj + w 2 yj + 6-5) 

i=l j=l 

N M N M 

= W\ (Y x > + Y Xj) + w 2 f ^ ^ Xjyj) 

i =1 _7=1 i=l _7 = 1 

N M N M 

+ b ^ x^ + — <a x^ — 5 Xj 

2=1 _7 = 1 2=1 J = 1 


Similarly, we have 


5 F 

( - / 2U2 ^ 



N 

M 


N 

M 

w i(E : 

XiVi + Y V'%' 

) +W 2 ( 
/ \ 

E 

yf + Y 

2=1 

.7 = 1 


2=1 

.7 = 1 

N 

M 

N 


M 

+KE 

Vi + Y M) ~ 


— 5 

Yvji 

2=1 

.7 = 1 

2=1 


. 7=1 

N 

M 

N 


M 


+ I+i) + ^ 2 ( 5 + + J+) 

2=1 J = 1 2=1 j = 1 


+ (M + +)6-a+-5M. 


Consider the following 3x3 matrices: 


E*i 

E x iVi 

E \ 

. / 

E 

E^i 

Yj X iVi 

E y, 2 

Ey* b 

^ = E*j%- 

E + 

E% 

T, x i 

E y* 

iv y 

V E^i 

E% 

M 



which contain information about the clusters Q and Q, respectively, such as 
size, first and second moments for the x and y variables, as well as their 
correlation. Consider two more vectors 


(\ 

P = « E Vi 

\ N J 


P = a\ E ?/y h 

\ M ) 


which depend on the labeis and the first moments. Then the vector equation 
VF = 0 can be written as the linear matrix equation 

(A + A)X = /? + /?, 


where X T — (wi, w 2 ,b). The solntion is given by 

X — (A + A) ■*■(/? + /3) = + "*"(! + ATI ■*■(/? + /?), 



580 


Deep Learning Architectures, A Mathematical Approach 



K Qf l (x,y) 


O/ ( x >y ) 


Figure 18.12: Neural network with no hidden layers and two outputs. 


where we used the inverse of the sum formula (G.1.5) from Appendix G. 
The existence of the solution holds under the condition ||AA _1 || < 1, which 
means that the eigenvalues of A are, respectively, smaller than the eigenvalnes 
of A. The geometrical signihcance of the previous condition is roughly the 
following: the clnster which is the longer is also the wider. For all practical 
purposes, the inverse matrix (A + A)~ l can be computed as indicated in the 
Appendix G. 

It is worth noting that this method can be applied to more than two 
clusters, as long as their convex hulls are mutually disjoint. 

The labeis a and 5 considered above were real numbers; for simplicity 
reasons, one may consider a — 1 and 5 = 0. However, this is not the only 
good possibility. 

Two-dimensional labeis We shall perform in the following the linear clas- 
sihcation of the clusters (?, Q using two-dimensional labeis, such as vectors or 
distinet points in the plane. Assume the label of the clnster Q is the point A 
in R 2 having coordinates (ai,a 2 ). Similarly, the label of the clnster Q is the 
point A with coordinates ( 01 , 02 ). 

As before, Proposition 18.5.2 States that the clusters Q and Q are linearly 
separable. In this case, we shall look for a linear function / : R 2 —> R 2 that 
maps the clnster Q in a neighborhood of the point A and the clnster Q in a 
neighborhood of A, see Fig. 18.13. 

The linear function / = (Z 1 , / 2 ) is constructed as the input-output func¬ 
tion of a feedforward neural network with a two-dimensional output and no 
hidden layers, see Fig. 18.12. This means that / is linear in each component 

fw,b( x i y) = w n x + w 2 i y + h 

flfiix, y) = W\ 2 X + W22y + h- 




Classification 


581 



Figure 18.13: The linearly separable clusters Q, Q are mapped into disjoint 
neighborhoods of point labeis A and A. 


The weights, Wij, and biasses, h/, have to be tuned such that the image 
sets /((?) and f(Q) are as close as possible to the points A and A , respectively. 
This can be achieved by taking 

(w, b ) — argmin G{w, 5), 

where G is the sum of sqnared distances to the point labeis, given by 


N M 

G(w,b) =-'^2d(f W:b (x i ,y i ), A) 2 + - ^ d(f Wjb (xj, yj), A) 2 

i=l .7 = 1 

N N 

= 2 TjT( x i,yi) -«l) 2 + 2^2T( x hVi) ~ a 2) 2 

i= 1 i— 1 

M M 

+ 2 Vj) - «i) 2 + 2 yj) - a 2 ) 2 

.7=1 .7 = 1 

=Fi(wn,W2l,6l) + i ? 2 (wi2,W22, 



with 








582 


Deep Learning Architectures, A Mathematical Approach 


1 -v 

F 2 (wi 2 ,W 22 , b 2 ) = - + W 22 Ui + b 2 - a 2 ) 2 

i=1 

1 M 

+ - ^ 2 (wi 2 Xj + w 2 2 Vj + b 2 - a 2 ) 2 . 

3 = 1 

The minimum is realized for the solution of the equation VG(ic, b ) = 0, where 

A7G{w , b ) = (V Wil Fi, V^Fi, V^i^, V6 2 F 2 ) G M 6 . 

The system contains 6 equations and 6 unknowns (4 weights and 2 biases). 
Given the formulas of F\ and F 2 , it can be shown, similarly with the case 
of one-dimensional labeis, that the system is equivalent to the following two 
linear Systems: 

(A + A)X\ — (5 1 + /?i 


(A + A)X 2 = p 2 + /? 2 , 


where A, A are dehned as in the previous case, Xj 


{wij,W 2 j, bj ), and 


Pi — j Vi I 5 

V N ) 


Pi — I ^2 Uj I • 

V M J 


Using formula (G.1.5) from Appendix 


G, the solution is given by 


X : = A- 1 (I + AA~ l )~ l (P: +/3:), 

X 2 = A- 1 (I + AA- 1 )- 1 (P-2 + 02), 


provided the condition 



< 1 holds. 


The use of softmax Most classihcation problems use an additional layer 
that implements a softmax function, which was introduced in section 2.1. 
We shall do this in the case of the previous example, see Fig. 18.14. For 
the sake of simplicity, we may choose the labeis to be one-hot vectors, i.e., 
(ai, a 2 ) = (0,1) = ei and (ai, a 2 ) = (1, 0) = e 2 . 






Classification 


583 



Figure 18.14: Neural network with an extra layer implementing the softmax 
function. 


The outcome of the new network is z — (z\, 2 : 2)5 with 



Since Zi > 0 and z\ + Z 2 — 1, it follows that 2 : belongs to the line segment 
joining the points A = (0,1) and A = (1, 0). Then the clusters Q and Q are 
mapped into the line segment AA , with the image of Q closer to the label 
point A and the image of Q closer to A, see Fig. 18.15. 

Assume the separation point of the clnster images is the middle of the 
segment, (1/2,1/2). Now, let (x,y) be a given point in the plane, and we 
have to decide to which clnster the point (x, y ) belongs to. We run {x, y) 
through the neural network and if the outcome belongs to the upper part 
of the segment AA, then the point belongs to the clnster Q. Otherwise, it 
belongs to Q. This test can be done using either the horizontal or the vertical 
coordinate. For instance, using the coordinate z\, we have 

if z\ < 1/2, then (x,y) belongs to Q\ 

if z\ > 1/2, then (x,y) belongs to Q. 

The use of softmax is also successful for classification problems involving 
more than 2 clusters. For instance, in the case of 3 clusters, Q\, Q 2 , and 
Qs, we associate the labeis A\ — (1,0,0), A 2 = (0,1,0), and As — (0,0,1) 
in M 3 , which forms an equilateral triangle, see Fig. 18.16. In this case the 
separation point is replaced by a System of separation curves obtained by 
taking perpendiculars from the triangle center onto its sides, see Fig. 18.17. 
In order to decide the decision region a point belongs to, it suffices to evalnate 
the distance to the vertices. For instance, if dist(P, Ai) < 1/2, then the point 
P belongs to the image of clnster 

















584 


Deep Learning Architectures, A Mathematical Approach 



Figure 18.15: The linearly separable clusters Q and Q are mapped into the 
terminal regions of a line segment. In the interior of the segment there is a 
separation point. 



Figure 18.16: The linearly separable clusters Q\, Q 2 and Q 3 are mapped into 
the vertex regions of a triangle. 














Classification 


585 


18.8.2 Nonlinear Decision Maps 

Consider two clusters in R 2 

^ i (^T 5 Vl ) 5 • • • 5 ( X N 5 Un) } 5 0 1 (^T 5 Ul ) 5 • • • 5 (T M 5 yM ) } 5 

and assume that hull(Q) n hull{Q) ^ 0. 

We associate two real numbers, a and 5, as labeis with the clusters Q and 
Q , respectively. By Proposition 18.5.2 the clusters Q and Q are not linearly 
separable. Therefore, it makes sense to look for a nonlinear function / : R 2 —> 
R that maps the clnster Q into a neighborhood of a and the clnster Q into a 
neighborhood of 5. 

The sigmoid neuron The easiest case of nonlinear function is prodnced by 
a sigmoid neuron, with the input-output function 


fw,b(x, y) = cr(wix + w 2 y + b ), 


where a(x) is the logistic function. We shall assume in this case that the 
labeis are given by a = 0 and a = 1. Thus, the cluster Q will be mapped 
toward 0, while the clnster Q toward 1. In order to make this map as localized 
as possible about the labeis values, we need to minimize the cost function 


F(w. b ) = 


1 

2 


N j M 

E a ( w i x i + w 2 Vi + b) 2 + - 22[a(wiXj + w 2 yj + b) - l] 2 . 

1=1 j=1 


Using the differentiation property of the logistic sigmoid, a’ — a(l — cr), we 
obtain 


N 


M 


d F — 

U W 1 1 

ET 0-2 (! “ a ) 




i =1 




N 

M 


d W2 F = 

E^ 2 (i - T 

-E^a-E 



i=l 

+ b j —1 



N 

M 


d b F = 

E> 2 a - 

- TT 1 - °') 2 | 

• 


i =1 


W 1 x i +W 2 Vi + b 


3 = 1 


w 


lXj + w 2 Vj+b 


Due to nonlinearity, there isn’t a closed-form solution for the minimum. The 
way to find the minimum is by applying the gradient descent method. The 
approximating sequence is given iteratively by 

— w^ — r] d Wl F(w ^, b ^) 

^(n+i) _ W F) _ ^ &( n )) 

&( n+ 1) = b^ — r/ dbF{w^\w^\b^)^ 





586 


Deep Learning Architectures, A Mathematical Approach 


^3 



Figure 18.17: The dashed lines separate the triangle into three decision 
regions. 

with the learning rate r/ > 0 and initial valnes — b = 0. It 

is worth noting that this is a variant of the logistic regression method and 
works only for the classification of two clusters. 

One-hidden layer network We may assume that the nonlinear function 
/ is the input-output function of a feedforward neural network with three 
layers, having a one-dimensional output. The neurons in the hidden layer 
are assumed to have nonlinear activation functions. Then the input-output 
map is 

K 

fwM x ^y) = J2 X k cr(w lk x + w 2k y + b k ), 

k =1 

where K is the number of hidden neurons, W{j are the weights from the input 
to the hidden layer, A& are the weights from the hidden layer to the output, 
and bk are the biases of hidden neurons, see Fig. 18.18. The function f w ^b 
depends on 4 K parameters (2 K weights wij, K biasses &&, and K weights 
A/e). The parameters have to be tuned such that the following cost function 
is minimized: 


G(w, 5, A) 


N 


r\ ^ ' ] [fw,X,b( x i ? Vi) 


: 1 




M 

^^[fw,\,b( x ji Uj) 
3 = 1 



(18.8.2) 


where <a, a are the labeis associated with the clusters Q, Q, respectively. 
Since there is no closed-form solution for the minimum of the previous cost 
function, a gradient descent type algorithm is needed, see Exercise 18.11.10. 

In the case of multiple clusters we may use multiple labeis and a similar 
method as before. However, using vector labeis, such as one-hot vectors, 
would provide a more elegant variant for tackling the classification problem. 






Classification 


587 



Figure 18.18: Neural network with one-hidden layer and one output. 


In this case the label space is R p , where p is the number of clusters that need 
to be classified, see Exercise 18.11.11. 

18.9 Decision Boundaries 

Consider two clusters, Q\ and G 2 , in R 2 . A decision boundary is a continuous 
curve in R 2 that separates the clusters, see Fig. 18.19. This means that for 
any g\ E Q\ and $2 €= G 2 , any continuous curve from g\ to $2 intersects the 
decision boundary curve. 

The next resuit States that the existence of a decision boundary is locally 
equivalent to the linear separation of the clusters. 

Theorem 18.9.1 (Rectification Theorem) Consider two clusters Q\ and 
G 2 in R 2 and a smooth decision boundary between them. Then there is a 
smooth nonlinear function F defined on a neighborhood of the decision curve 
with values in R 2 , such that F maps the decision curve into the x-axis, the 
cluster Qi into the upper half-plane and the cluster G 2 into the lower half- 
plane. 

Proof: Denote by c(s), 0 < 0 < T, the decision curve and consider its 
evolution through the normal flow, 4>t(c(s)), with —e<t<e. This is 
3>*(c(s)) — c(s) + t7V(s), where N is the unit normal vector to c(s), see 
Fig. 18.20. For small enough e, the flow does not have any singularities. The 
cluster Qi corresponds to values t > 0 and the cluster Q 2 corresponds to t < 0. 
The set U — {<1?t(c(s); \t\ < e, 0 < s < T} dehnes a neighborhood of the deci¬ 
sion curve. Then dehne the mapping F \ IA R 2 by F(<f>t(c(s)) — (s,t). Since 




588 


Deep Learning Architectures, A Mathematical Approach 




A A A A A 

* * * .*. 

A A A A A 

A A A A 


Figure 18.19: Decision boundary curve between two clusters in R‘ 



Figure 18.20: The function F maps the decision curve inio the x-axis. 


$o( c ( 5 )) — c(s), then F maps c(s) into (s, 0), and sends the cluster Q\ n U 
into the upper half-plane and the cluster into the lower half-plane. 

■ 

This resuit can be extended to higher dimensions. If the clusters Q\ and 
Q 2 are in R n , a decision boundary is a (n — l)-hypersurface, TL, in R n , which 
separates the clusters, i.e., for any g\ E Q\ and g 2 G ^2, any continuous curve 
from gi to g 2 intersects the hypersurface TL. 

Similarly with the previous case, we can construet a System of coordi- 
nates in a neighborhood of TL. Any point P in this neighborhood can be 
projected onto TL in a point P'. The coordinates associated with P are 
(xi,..., x n _i, x n ), where (xi,..., x n _i) are the coordinates of P' on TL and x n 
is the length of the projection of P onto TL, considered with the sign given by 
the orientation of the hypersurface. Thus, the function F(P) — (xi,..., x n ) 
maps TL into the hyperplane {x n — 0}. 























Classification 


589 


The case involving more than 2 clusters is more complex, as in this case 
the decision boundary has a richer structure. For instance, if the clusters are 
situated in a plane, we have the following dehnition. Consider p clusters of 
points Qi, Q 2 ->..., Gp in R 2 . In this case, a decision boundary is a connected 
system of continuous curves in R 2 that separates the clusters. This is, for any 
Qi G Qi and gj G Gj, with i ^ j, any continuous curve from gi to gj intersects 
the decision boundary. 

18.10 Summary 

During the classification process of clusters, labeis have to be accurately 
assigned to each clnster. There are several types of labeis that can be assigned: 
numbers, one-hot vectors, points, etc. The mapping assigning labeis to clus¬ 
ters is a decision map which can be learned by neural networks. 

Two clusters are linearly separable if their convex hulls are disjoint. A 
single perceptron can learn the associated decision function in this case. 

If the convex hulls of two clusters are not disjoint, the clusters are not 
linearly separable, and in this case a neural network with nonlinear activation 
function is used to learn the associated decision function. 

The use of the softmax activation function in the last layer of a neural 
network maps the n clusters in the corners of an (n — l)-dimensional polytop. 

The Rectification Theorem States that the existence of a decision bound¬ 
ary between the clusters is locally equivalent to the linear separation of the 
clusters. 

18.11 Exercises 

Exercise 18.11.1 Let S — {I\ x I{) n (Q x Q) be the lattice of rational 
numbers in [0,1] x [0,1]. 

(a) Show that S is a transitive relation; 

(5) Is S an equivalence relation? 

Exercise 18.11.2 Show that k points are in a general position if an only if 
there is a unique (k — l)-hyperplane containing the points. 

Exercise 18.11.3 Let Pi,..., P& be k points in R fc in a general position, and 
let T~L be the unique (k — l)-hyperplane containing them. Show that the set 

{P 1 P 2 , • • • 5 P\Pk) forms a system of independent vectors in T~L. 

Exercise 18.11.4 Consider two distinet real numbers aq, G R. Show that 
there is a unique affine function f(x) — ax + b such that f(x 1 ) = 1 and 

f{x 2 ) = 2. 



590 


Deep Learning Architectures, A Mathematical Approach 


Exercise 18.11.5 Let Q be a cluster in R n . 

(а) Show that hull(Q) is a convex set; 

( б ) Prove that hull(Q ) is the smallest convex set in R n which contains Q ; 
(c) Verify that 


hull(Q ) = P|{K; Q C K C R n convex}. 

K 

Exercise 18.11.6 Let «L : R 2 —> R 2 be a nondegenerate affine function, i.e., 
$(x) = Wx + 5, with det W ^ 0. Consider two linearly separable clnsters in 
R 2 , Q\ and Q 2 . Show that <&(Gi) and $((/ 2 ) are linearly separable. In other 
words, affine functions preserve linear separability. 

Exercise 18.11.7 Let Q be a finite cluster of points and denote by G its 
center of mass. Show that G E hull{G). 

Exercise 18.11.8 Show that a family of clnsters © = {C/i,..., Q^} in R n is 
convex separable if and only if the family is mutually convex separable, i.e., 
any two clusters Qi and Qj are convex separable for any i 7 ^ j. 

Exercise 18.11.9 (a) If A and B are two convex sets such as A n B — 0, 
then A and B are linearly separable; 

(■ b ) If G\ and Q 2 are two clusters such that hull(Gi) H hull(G 2 ) — 0, then Qi 
and G 2 are linearly separable. 

Exercise 18.11.10 (a) Find the gradient V w j>,\G, where G is given by (18.8.2). 

( b ) Write the gradient descent recursion for the approximation seqnence of 
the minimum. 

Exercise 18.11.11 Consider p clnsters, Q \,..., Q p , of points in R 2 . Write the 
cost function that associates the vector ej as a label for the cluster Qj, for all 

j 1,•• •, P' 

Exercise 18.11.12 Prove that the entropy associated to a partition given 
by (18.2.1) is maximum if and only of all sets Aj have eqnal measure. 


® 

Check for 
updates 


Chapter 19 

Generative Models 


So far, neural networks were useful for two main types of important problems: 
regression and classification problems. While pursuing regression a neural 
network with a one-dimensional output has been used, which is having a 
linear activation function in the output layer. In the case of classification 
problems it is useful to employ a neural network with a multidimensional 
output, which is having a softmax activation function in the output layer. 

In this section we shall deal with another important application, which 
is the construction of generative models. If a regression problem can forecast 
future patterns and a classification problem can recognize typefaces, a gen¬ 
erative model can generate examples which are very much like the training 
data. 

19.1 The Need of Generative Models 

There are many reasons for considering generative models. Their successful 
applications made them very popular in many areas of industry. The ability 
of generative models of generating examples that are very similar to training 
data makes them useful in situations when available data is insufficient or 
even missing. In many real-life situations data augmentation might be costly 
and time-consuming. Generative models can achieve this job at a lower cost. 
We shall provide a few examples. 

1. For safety reasons training a neural network for driving a car, or operating 
a robot, cannot be done in a real-life environment. Therefore, a generative 
model is used to provide a simulated environment where the car is trained 
until the model is ready to be deployed in real life. 

© Springer Nature Switzerland AG 2020 591 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_19 



592 


Deep Learning Architectures, A Mathematical Approach 


2. In the case of setting a business plan, a generative model can be employed 
to provide a simulation of possible future business environments. These pos- 
sible environments are generated by similarity with past events. 

3. Another application is to price financial contracts such as derivatives. 
These are contracts whose price depends on the price of an underlying stock. 
The traditional computational technique is to use the Monte Carlo method. 
This assumes a large number of possible stock price simulations under which 
the financial contract is priced. Then an average of all the simulated contract 
prices is considered as the computed contract price. Generative models can be 
used to simulate stock markets and price financial contracts in these markets, 
providing an idea about the real contract price. 

4. Other examples are the use of generative models in cartoons and movie 
industry, where real-life-like environments need to be generated, such as 
buildings, trees, mountains, etc. Generative models can be also used to gen¬ 
erate mnsic, paintings, people’s faces, etc. For instance, the iGAN Software 
was created to generate pictures starting from a rough hand sketch. 

We shall discuss next the two types of generative models: the ones that 
provide a density estimation and the ones that produce sample generations. 


19.2 Density Estimation 

The task here is to create a machine that observes many samples from a 
given distribution, Pdata( x ) 5 and it is able to create later more samples from 
that distribution. We distinguish the following two cases: parametrio and 
nonparametric distribntions. 

Parametric case If there is a reason to believe that the given distribution, 
Pdata (t) , can be estimated by a parametric distribution, Pmodel( x i 0), then we 
just need to adjust 8 such that the log-likelihood of the model distribution 
evalnated on the data is maximum 


0 — argmaxE 
6 


XrsJ Pdata 


.In Pmodel(x; 0)_ . 


We note that in practice the expectation is computed as an average 

1 n 

^x^Pdata [l n Pmodel( x ] »)] = ;£ In Pmodel(. x ii 

rt 

i—l 

where xi ^ Pdata is a sample extracted from the given distribution. 

Another equivalent way to select the parameter 6 is by requiring the 
Kullback-Leibler divergence between the data and the model distributions to 



Generative Models 


593 


be minimum 

9* = arg min Dkl \pdata \ \Pmodel] ■ 

9 

This follows from the computation 

d* = arg maxE x ^ Pdota [ ln p mode i(x; 0)] = arg min ^ x ~ Pdata [ ~ lnp modd (x; 0) 

= argmin { - /pW*) = argmm5 Ipdata>Pmodel) 

= arg min [S(p da ta,Pmodel) ~ H(p data )} = eagmmD KL (p data \\p modd ), 

9 9 


where we used the independence of the entropy of the given data, H(pd a ta)i 
from 6 and the defmitions of the cross-entropy and Kullback-Leibler diver- 
gence. 

Nonparametric case The idea, in this case, is the following: We take a 
simple distribution (such as a uniform or a Gaussian) and then apply a non- 
linear transformation G to samples from that distribution to obtain samples 
from the desired distribution, Pdata ( x )• 

We shall introduce hrst some terminologies and notations. The space 
where x takes values is denoted by X and represents the space we care about. 
The simple distribution, p(z), subject to be transformed by G into Pdata ( x ) 5 
is called the code distribution ; the space Z where z is sampled from is called 
the latent space. Now, we would like for each random selection z E Z to 
obtain x — G(z ) as a sample from the space A\ 

It might be useful to make the connection to the random variables ter- 
minology. The latent variables z are instances of a random variable denoted 
by Z taking values in the latent space Z. The samples x from Pdata( x ) are 
considered as instances of a random variable X on the space X. Under these 
notations, we are looking for a transformation G that maps the variable Z 
into X. This can be achieved in a couple of ways. 

1. We assume the data distribution Pdata is one-dimensional and we shall 
consider a uniform coding distribution, p co d e ~ £7m/[0,1]. We shall show 
that the transformation G : [ 0 , 1 ] -0- T, which satisfies G{Z) — X is given by 
G = F da L> w here F data (x ) = P^PdataO) ds is the cumulative distribution 
function of the random variable X. 

The key point here is that it Z ^ C/m/[0,1], then P(Z < z) = z, for all 
z G [0,1]. Now, in order to show that the random variables X and Y = G(Z) 
have the same distribution, we use the computation 


F y {x) = P(Y <x) = P(G(Z) < x) = P(Z < F data (x)) = F data (x), Vx G X. 



594 


Deep Learning Architectures, A Mathematical Approach 





Figure 19.1: The transformation G between the latent space and data space. 


To conclude, for generating n samples in space X following distribution 
Pdata 5 we select uniformly n random instances zi E [0,1] and consider X{ — 
G(zi ), provided G — is known. Then all the constructed instances 

satisfy Xi ~ p data . 

Since seldom in practice the distribution is one-dimensional, we shall con¬ 
sider next a multidimensional case. 

2. In this case the coding distribution, p co dei is multidimensional and not 
necessarily uniform. The latent space is denoted by Z and we consider an 
invertible differentiable transform G : i? —> X, whose both its inverse and 
determinant of the Jacobian can be computed, see Fig. 19.1. Then this change 
of variables produces a new density on X given by 

Pmodel(x) = px(x) = (19.2.1) 

| det Jg{z)\ 


Example 19.2.1 As an example, we shall show how we can construet sam¬ 
ples from the normal multivariate distribution A/"(/i, E), where /i is a given 
vector in W 1 and S is a symmetric, nondegenerate, and positive-definite n- 
dimensional matrix. We consider the affine transform G : W 1 —> R n , given 
by G(z) — /i + Az, where AA T — S. Obviously, det Jg ( z ) — det A ^ 0 , and 
hence G is invertible, with the inverse 


G 1 (x) — A 1 (x — /i). 


We choose the coding distribution to be the Standard normal distribution 
with zero mean and identity covariance 


1 


e 



Pcode^z) 


( 27 r ) n / 2 


2 


5 









Generative Models 


595 


namely, Z ~ J\f(0, I n ). Using formula (19.2.1) we obtain the following density: 


Px(x) = 


b\\A 1 (x-fi )\\ 2 


det A\ ( 27 r) n / 2 

1 1 


e 2 


— 1 (x—/i),a 1 (x—/i )) 2 


e 2 


(det E) 1 / 2 (27r) n / 2 

((27r) n det E) 1 / 2 

1 


yj (27r) n det E 


— i(^ — /i)S 1 (x —/i) 


e 2 


where we used the algebra relations (.A 1 ) T A 1 = (A T ) X A 1 = (AA T ) 1 . 

It is worth noting that using this method following formula (19.2.1) has 
some downsides. First, the method assumes that the determinant of the Jaco- 
bian and the inverse of the transform G are computable. Another downside 
of this method is that the dimension of the latent space is equal to the dimen- 
sion of the data space, fact that involves too many parameters. If the data 
space represents a picture, then using this method will involve as many latent 
variables as pixels in the picture! This is the reason for constructing a method 
which has a latent space of a much smaller dimension than the data space. 

The second type of generative models is the sample generation, which gen- 
erates more samples rather than finding the density function. In this group, 
we shall discuss the generative adversarial networks (GANs) and generative 
moment matching networks. We shall discuss in the next section the adver¬ 
sarial aspect first. 


19.3 Adversarial Games 

We consider two players who are engaged in a competitive game, fighting over 
the same payoff function: one wants it to be high, while the other wants it to 
be low. This way, the players become adversaries , being interested in opposite 
rewards. Each of them Controls certain variables of the payoff function and 
tries to adjust them to get the maximum benefit. For certain parameter values 
the players might arrive at an equilibrium. This can be understood using the 
following example. 

We may think of one player as the seller and of the other as the buyer of 
a certain product. Both parties have to agree upon the product price. The 
seller will try to push the price up, while the buyer would like the price to be 
as low as possible. At the end of the negotiation process the seller agrees to 
lower the product price enough and the seller agrees to pay sufficiently more 










596 


Deep Learning Architectures, A Mathematical Approach 


such that the parties enter a seller-buyer agreement. This way they arrive 
at the equilibrium price , which corresponds to the Nash equilibrium point of 
the game. 

The problem can be formulated as a minimax problem as in the following. 
Consider a payoff function, V (x, y), which depends on variables x and y. The 
first player Controls the variable x and the second player has control over the 
variable y. The first player wants the payoff function V (x, y) to be minimized, 
while the second one wants V(x,y) to be maximized (which is eqnivalent to 
minimize —V(x,y)). The problem can be formulated as 


(x*, y*) = argmaxmin V (x, y). 

y x 


The equilibrium point can be obtained using a simultaneous gradient descent 
method in continuous time variant as in the following. 

Since the first player Controls the variable x and intends to minimize 
y(x,y), it should adjust x in the direction of the negative gradient of V by 
a step rj — At 

dV 

x(t + At) = x(t) — rj ——. 

dx 

Similarly, since the second player Controls the variable y and would like to 
maximize F(x,y), it should adjust y in the direction of the positive gradient 
of y by a step rj = At 

dV 

y(t + At) = y(t) + i 7 —. 

Taking At —> 0 we obtain a simultaneous gradient descent with an infinites- 
imal learning rate, where the learning process follows a smooth trajectory 
(x(t),y(t)) that satisfies the following continuous time differential System: 


dx 

dt 

dy 

dt 


dV 

dx 

dV 


(19.3.2) 

(19.3.3) 


The initial condition is (x(0),y(0)) = (a'o- yo) and the equilibrium point, if 
exists, is obtained as the limit (, x*,y*) — lim (x(t),y(t)). Since at equilibrium 

t—> OO 


dx* 

we have —— 
dt 


dy* 

dt 


0 , then the equilibrium points satisfy the System 


dV 

dx 

dV 


0 


dy 


0. 



Generative Models 


597 


It is worth noting that the change of the payoff function V(x,y) along the 
learning curve (x{t),y{t)) is given by its differential, which is given by 


dV dV 

dV — —— dx + ~ 7 ^dy — —xdx + ydy. 


dx 


dy 


Example 19.3.1 The revenue obtained by selling q units of a certain product 
at the price p per unit is V (p, q) — pq. The seller can control the price, p, and 
is interested in maximizing the revenue V(p,q). The buyer can control the 
number of units sold, g, and would like to minimize the price paid, V(p,q). 
This becomes a minimax problem and it will be approached using the simul- 
taneous gradient descent with an inhnitesimal learning rate. The learning 
process follows a trajectory (p(t),q(t)), which satisfies the continuous-time 
system (19.3.2)—(19.3.3), which in this case becomes 

dp 

dt 
dq 

dt 

Differentiating one more time with respect to £, we obtain p = —p and q — —q , 
which implies that both p(t) and q(t) are linear combinations of cos t and sin t 



p{t) — A\ cos t + B\ sin t 
q{t) — A 2 cos t T B 2 sin t, 


with Ai and constants. Substituting into the differential system and using 
the initial conditions we determine the constants and write the solution as 


p(t) — po cos t + qo sin t 
q(t) — q 0 cost - po sint. 


Since this can be represented in the matrix form as (p,q) T = 7£(£)(po, qo) T , 
where 

cos t sin t \ 

— sin t cot t J 

denotes the rotation in the plane of angle t, it follows that the solution 
(p(t),g(t)) follows a circle trajectory centered at the origin and having the 
radius r = \Jp q + qf } . This solution does not approach any equilibrium point 
as long as r ^ 0. The only equilibrium point is obtained at the origin, 
(p*, q*) = (0, 0), and this occurs in the case po = qo — 0. 








598 


Deep Learning Architectures, A Mathematical Approach 


19.4 Generative Adversarial Networks 

Generative Adversarial Networks (GANs) have been introduced in 2014 by 
Goodfellow et al. [47] and are considered nowadays to be one of the most 
powerful generative models. 

GANs represent a noncooperative game played by two neural networks: 
the discriminator and the generator. These networks act as adversaries, hav- 
ing opposite rewards, the worst-case input for one network being produced 
by the other network. During this competitive game each network forces the 
other network to improve. 

Generator network The generator is a network whose input is some random 
noise selected from a latent space and the output is an image x, which is 
supposed to resemble the images in the data space. The generator output 
is x — G(z;6^), where 6^ are the generator network parameters and z 
is a latent vector variable in a latent space Z. We should represent by Z 
the random variable with instance z taking values in the latent space Z 
and by p co de{ z ) its probability density. We denote by X the output random 
variable, namely X — G(Z]6 ^), and denote its density by Pmodel( x ]6^)- 
The generator function G mnst be differentiable, with the dimension of Z 
less than the dimension of X. 

Discriminator network The discriminator network works as a classifier. 
This means that the discriminator is a network that is being given an input 
x (for instance, an image) and produces a number D{x\9^) between 0 and 
1, where 9^ are the discriminator network parameters. This number can 
be considered as the probability that the input x is regarded as a genuine 
training data, with a valne of 0 assigned to the case when the discriminator 
rejects completely the input x as belonging to the training data. If the input is 
an image, the discriminator can be considered as a convolutional network and 
can be trained using the gradient descent. The discriminator and generator 
network functions can be seen in Fig. 19.2. 

The generator’s job is to full the discriminator, making it believe that its 
output is a real training data; this means the generator wants D(G(z )) close 
to 1. On the other side, the discriminator’s goal is to prove the generator wrong, 
namely, it would like to output a valne D(G(z)) close to 0, for all z G Z. 

The training process functions on cycles. In the beginning the generator 
is not that smart, prodncing some random noise. Over time, the generator 
will improve, producing images more and more similar to the images in the 
training dataset. In the same time, the discriminator is trained on fake and 
real images, using a batch of images produced by the generator and a batch 
of images selected from the training data set. As the discriminator becomes 


Generative Models 


599 


real samples 



Figure 19.2: The anatomy of a GAN: latent variables are fed into the gener¬ 
ator producing “fake” data. Then repeatedly sample two minibatches of data, 
one from the real data and another from the generated data and let the dis¬ 
criminator decide whether it is fake or genuine. Using this decision, we can 
update the networks parameters to improve the GANs functionality. 

more skillful in its job, the generator will tend to produce images that resem- 
ble more and more the ones in the data space. If the training is successful, 
the generator will produce at the end images exactly indistinguishable from 
the real training images. Also, in the end the discriminator will not be able 
to distinguish whether the output is a fake or real, providing a probability 
output of 0.5. At this point the discriminator becomes useless and can be 
neglected, while keeping only the generator network. 

Example 19.4.1 Consider a GAN which is supposed to generate prime num- 
bers. We assume that the training data is the set of the first n prime num- 
bers, A = {pi,P 2 , • • • ,Pn}- If the generator network produces a number x, 
the discriminator network will test whether this number is prime by check- 
ing whether the division between x and pi is even. The discriminator pro¬ 
duces D{x) — 0 if there is a prime number pi that is a divisor of x, and 
D(x) = 1, otherwise. In the beginning the generator might produce, for 
instance, x — P 1 P 2 and the discriminator will easily classify it as a nonprime 
number. But over time, the generator will learn that it has to produce a 
number larger than all the prime numbers provided, which is not a multiple 
of them, such as x — P 1 P 2 • • -Pn + 1* Ia this case the generator will produce 
D( x) = 1, i.e, will classify the number x as prime. The generator can use this 
information, producing next time the output x — p\... p n (piP2 • • • Pn +1) + 1, 
and so forth. 











600 


Deep Learning Architectures, A Mathematical Approach 


We note that several discriminator functions D(x) can be used. If we 
denote by div(x) the nnmber of divisors of x among the set of known prime 
nnmbers, then we can also define D(x) — e 2 - dlv ( x )^ or and 

consider them as discriminator functions. 


Loss function The discriminator tries to maximize the payoff 


V(G, D) — E 


Pdata 


[ln D(x)\ + E 


X^Pmodel 


[ln(l — D(x)) 


(19.4.4) 


Maximizing the first term assures for a correct classihcation of true data, 
while maximizing the second term ensures for the correct classihcation of 
the model generated data. This follows from the fact that the discriminatori 
output D(x) tends to be close to 1 for true data x, while 1 — D{x) tends to 
be close to 1 for generated data x. 


Example 19.4.2 The simplest example of discriminator network is a sigmoid 
neuron. If the input is a vector x T — (xi ,..., x n ), then the discriminator 


produces the probability D(x]w,b) — a^w 1 x + b) = -(tr x + b ) ? where w 

and b denote the weights and the bias, respectively. In the following we shall 
evalnate the payoff function (19.4.4). Using the properties of logarithmic and 
softplus functions, we have 

ln D(x) — — ln(l + e~^ w x + b ^ — — sp(—(w T x + b)) — x — sp{w T x + 6), 


and then 


^X^p data [l n D{x)\ — [i x E x^p data 


[sp{w T x + b)] 


where p x denotes the mean of generatori output. We also have 


1 — ln D(x) — 1 — a(w T x + b) = a(—w 1 x — b) — 


T 


gw T x-\-b 


so that 


ln(l — D(x)) — — ln(l + e w x + b ) — —sp{w T x + b). 
Taking expectation, we obtain 


E 


X^Pmodel 


[ln(l - D(x))} = —E 


[, sp(w T x + b)]. 


X^Pmodel 


Therefore, the payoff function becomes 


V ( G , D) = fi x - E x ^ Pdata [sp(w x + b)}- E x ^ Pmodel [sp{w x + 6)] 
= Mx “ J ( Pdata(x ) + Pmodel(x)) Sp(w T X + b) dx 
= d x ~ E x ^ Pm [sp(w T x + b )}, 
where Pm(x) = \{pdata{x ) +Pmodel(x)). 







Generative Models 


601 


The generatoFs payoff is —V(G,D) (and hence the name of a zero-sum 
game), namely the generator intends to maximize — V (G, D ), or equivalently, 
to minimize V (G, D). 

Since each network attempts to maximize its own payoff, this can be set 
now as the following minimax problem: 



arg min max V(G, D) 
GD 


5 


where G* denotes the optimum generator. 

For the time being we shall consider the generator fixed and work on the 
inner maximum loop. For a given generator, G, the optimum discriminator 
function, Dq is given by 



arg max V (G, D ). 


The next resuit provides the value of this discriminator. 


(19.4.5) 


Propositiori 19.4.3 For a fixed generator, G, the optimal discriminator 
(19.4.5) is given by 



Pdata (^) 

Pdataip) T Pmodel(%) 


(19.4.6) 


Proof: Considering distributions Pdata and Pmodel defined on the space T, 
the payoff function can be written as the following integral: 


V(G, D) = / Pdata(x) In D(x) + p mode i(x)ln(l - D(x)) 

Jx 1 


dx. 


Since G is fixed, we consider the payoff as a functional of D{x) as 


F(D)= [ L(D(x))dx, 

J x 


with the Lagrangian function 


L(G ) = Pdata(x) 1 nD(x) + Pmodel (x) ln(l - D(x)) 


Critical points satisfy the variational equation 


dL(D) 

dD 



Pdataip) Pmodel{%) _ q 

D(x) 1 — D(x) 


This becomes 








602 


Deep Learning Architectures, A Mathematical Approach 


Solving for D we obtain the solution 


D* G (x) = 


Pdata ( x ) 


Pdata( x ) T Pmodel ( x ) 

In order to show that this critical point corresponds to a maximum, we 
consider the second variation and show that it has a negative valne 

d L(D) _ Pdata( x ) Pmodel( x ) 

dD 2 ““ D(x) 2 ~ (1 -D(x)) 2 ' 

Hence, Dq (x) corresponds to a maximum of V (G, D ) for a given G. ■ 


Remark 19.4.4 (i) We note that the optimum Dq(x) depends on G through 
the density Pmodel ( x ), which is the density of G(z), with z ~ Pcode • 

(ii) This is a theoretical resuit of existence and uniqueness of the optimum. 
However, the resuit is not practical if Pdata is n °f given. 


The generator optimum, G*, is now given by 

G* = arg min V (G, Dq) . 

G 

In order to hnd it, we hrst need to evaluate the maximum valne of the payoff 
V(G,D) at its optimum point. 

Propositiori 19.4.5 IfDjs denotes the Jensen-Shannon divergence given by 
(3.7.4), then the maximum value of the payoff function is 

V(G, Dq) = 2D JS {Pmodei(x)\\pdata(x )) - ln 4. (19.4.7) 

Substituting the value given by formula (19.4.6) into the payoff we obtain 


V(G,D*g) = 


V L 


V L 


Pdata(x) ln Dq(x) +Pmodel(x) ln(l - Dq(x )) 

Pdata ( X ) 


dx 


Pdata(x) ln 


J rPmodel( x ) ln 


Pdata( x ) T Pmodel( x ) 
Pmodel ( x ) 


dx. 


Pdata( x ) T Pmodel ( x ) - 

In the following, we shall add and then subtract ln 2 Pdata( x ) an d ln 2 p mo del( x ) 
and use some algebraic manipulations to split it into three integrals 


V(G,D* g ) = 


v L 


(ln 2 - ln 2 )p data (x) + p da ta ( x ) ln 


Pdata( x ) 


+ (ln 2 ln 2 )Pmodel( x ) T Pmodel( x ) ln 


Pdata( x ) T Pmodel( x ) 
Pmodel ( x ) 

Pdata( x ) T Pmodel ( x )- 


dx 


— h + h + h 
















Generative Models 


603 


where 





/ lll 2 (jPdata (*e) “1“ Pmodelip^)) dx 
JX 

Pmodel i x ) 


Pdata ( x ) 




ln 2 + ln 


Pmodel (x) 




ln 2 + ln 


Pdataip ") T Pmodelix) - 
Pmodel i x ) 


Pdatai x ) T Pmodeli x ) 


dx 

dx. 


In the following we evaluate each integral. Using that Pdata and Pmodel are 
probability density functions, we have 


I\ — — ln 2 


Pdata 0)^ + / Pmodel(x)dx 


X 




a; 


= —2 ln 2 = — ln 4. 




: 1 


: 1 


Using properties of logarithmic function, we evaluate the second integral as 
a Kullback-Leibler divergence 


/2 = 


/ \ i 2Pmodel( x ) 

PdatayX) lll / \ - / \ 

Af Pdata \ x ) T Pmodel \ x ) 


dx 


Pdata( x ) ln 


Pmodel {x) 


X 


— DkL (pdata(x) 


2 iPdatai x ) ~\~ Pmodel i x )) 

ii Pdatai x ) Pmodelix ) 


dx 


The third integral can be evaluated in a similar way 


U = 


p _ 2 Pmodel(x ) _ 

ymodely^) 111 / \ \ 

Af Pdata\X) T Pmodel \X) 


dx 


Pmodelix) ln 


Pmodel (x) 


X 


DkL (^Pmodel{x) 


2 iPdataix) ~\~ Pmodelix)) 

II Pdata(x) T Pmodelix) 


dx 


Using the dehnition of the Jensen-Shannon divergence (3.7.4), we obtain 


V{G,D* g ) = /1 + J2 + J3 

= - ln 4 + (Ws) 11 W “‘° (X) + 2 P " 1 ° lld W ) 

ln 4 T 2 D JS (Pmodel (x) | |p^ata (*^) ) • 






















604 


Deep Learning Architectures, A Mathematical Approach 


The next resuit shows an expected fact of GANs, namely, that at the 
equilibrium convergence the model distribution approximates the data dis- 
tribution (Goodfellow et al. [47]). 

Propositiori 19.4.6 The global minimum 

G* = arg min V (G, Dq) 

G 

is achieved if and only if p model — Pdata • At this minimum point the value is 

— In 4. 


Proof: Using relation (19.4.7) we have 


G* = 


arg min V ( G , Dq) = arg min (2 Djs(p m odel(x) \ \ Pdata(x)) ~ ln4 

Gr Cr 


= arg min \2D JS {p m odel(x)\\pdata(x))J. 

By Proposition 3.7.1 the right-side expression reaches the global minimum 
value of zero if and only if Pmodel { x ) — Pdata( x )• In this case we obviously 
have min^ V(G, Dq) — V(G*, Dq) — — ln4. ■ 

The discriminant output at equilibrium is obtained substituting Pmodel — 
Pdata into (19.4.6) 


D*(x) = 


Pdata ( x ) 


Pdata( x ) T Pmodel( x ) 


1 

2 


The fact that the discriminatori output values are 0.5 means that it cannot 
distinguish whether the generated data x are fake or true. At the equilibrium 
the maximum value of the payoff function is 


y(G*,L>*) = 


v L 


Pdata {x) ln D*(x) + Pmodel{ x ) ln(l - D* (x)) 


dx 


V L 

ln- 

2 


1 1 

Pdata ( x ) l n ^ + Pmodel ( x ) ln 


dx 


V L 


Pdata( x ) T Pmodel( x ) 


dx 


— 2 ln - = — ln4, 

2 

which recovers the resuit of the previous proposition. 

In general, the dimension of the latent space Z is smaller than the dimen- 
sion of the data space X. In the case when they are equal and the generator 
transformation G : Z —> T is differentiable and invertible, the model and 
code distributions are related by 


Pmodel( x ) Pcode{G (x)) det Jq— i^)* 











Generative Models 


605 


Moreover, the second term of the payoff V (G, D ) can be written as an expec- 
tation over the coding density as 


E 


•E^Pmodel 


[ln(l — D(x))] — E 


Z^Pcode 


[ln(l - D(G(z)))\. 


The computation involves the change of variables z = G 1 (x) as in the 
following: 


E 


VC^Pmodel 


[ln(l - D(x))} = / ln(l - D(x))p mode i(x) dx 


x 


/ ln(l - D(x))p code (G 1 (x)) det J G -i m dx 

Jx 


ln(l - D(G(z))) p code (z) dz 


z 


= E 


Z^Pcode 


[In (1 - D(G(z)))}. 


Remark 19.4.7 The minimax problem minmax VTG, D) is not the same as 

GD 

maxmin V(G, D). The former is the correct version, while the latter does 
DG 

not work well since the generator might place all mass on most likely points, 
fulling the discriminator. This leads to a mode collapse, which is a setting 
where the GAN produces the same output. 


Training procedure A GAN is trained by an application of a simultaneous 
stochastic gradient descent method. We repeatedly sample two minibatches 
of data, one from the training set and another from generated samples. Then 
we run the gradient descend on both players crossed simultaneously. This will 
lead to one update for each player in the direction, or the opposite direction 
of the gradient as follows: 

e^ ] +1 = eW- v v 0W v(oW) 

9^1 = 9^ + v\/ d{g) V(9^), 


where r/ > 0 is the learning rate and V is the payoff defined by (19.4.4). The 
gradients can be computed as in the following. Using that D (x) = D(x , 0(d)) 
and that Pmodel and Pdata are independent of 6^ d \ we have 


V q(<i)V — V^(d)E a ,^ Pdata [lni4(x, 0^)] + V 

1 dD(x) 


= E 


x ~ Pdata l D (x) oew 


-E 


eW^x^Pmodei [ln(l D(x,9 *■ ^)) 
1 dD(x ) 


'E r ^Pmodel 




If D{x) — cr(a (x,9^), where a is the logistic sigmoid, then using 


dD(x) 

dQG) 


= a'(a(x, 9^)) 


da(x) 

39W 


D(x)( 1 


D(x)) 


da(x) 

39 ( d ) 


5 


















606 


Deep Learning Architectures, A Mathematical Approach 


the gradient computation can be continued as 


V Q(d)V = 


d 

~QQ(dj^~Pdata K®)] 


2E. 


'x~\{Pdtata+P 


moaei 


It is worth noting that the first expectation at the equilibrium point is equal 
to the Kullback-Leibler divergence DxL(Pdata\ \Pmodd), see Exercise 19.7.5. 
We have assumed the probability densities Pdata and Pmodel decrease to zero 
at infmity fast enough such that the derivation operator commutes with the 
expectation operators. 

The gradient V q( 9 )V is computed similarly, see Exercise 19.7.7. 


Even if GANs are considered the most successful generative models, there 
are stili some unsolved problems, which are active areas of research nowa- 
days. These include: vanishing gradients, mode collapse, solving the problems 
dealing with counting, perspective, and global structure. 


19.5 Generative Moment Matching Networks 

Generative moment matching networks have been introduced by Li et al. [76] 
in 2015. In this case the generator is trained by moment matching , a techniqne 
by which the generator’s outcome has moments as close as possible to the 
corresponding moments of the training data. 

This idea is based on the resuit stating that if two random variables have 
equal moments then their distributions are the same. If the random variables 
are of a particular type, less moments would be needed. In particular, two 
Gaussian random variables with the same mean and variance are the same. 

This generative model is a generator network, such as a convolutional 
network, which is provided a sample z from a uniform (or gaussian) distri- 
bution and produces an output x, such as an image of a face or digit, for 
example. If the network parameters are denoted by 0, the outcome variable 
X can be written in terms of the input variable Z as X — G(Z]6), with 
Z ~ Unif[ 0,1], for example. The density of X will be denoted by p$. 

We assume the corresponding random variable describing the training 
data (such as a set of human faces or MNIST data digits) is denoted by 
Y and has density q. The network parameter 6 has to be tnned such that 
the distributions of X and Y are as close as possible, and this would be 
done my matching the hrst k moments. For this, we choose the function 
( j)(x ) = (x, x 2 ,..., x k ) T and consider the parameter 9 that minimizes the 
maximum mean discrepancy dehned by (3.8.5) 















Generative Models 


607 


0 * — arg min d 
6 


MMD 


(pe,q) = arg min || ^(G(Z;6)) - \\ Eu - 

9 


In practice we pick n random inputs {zi,..., z n } that yield an output sam- 
ple {xi,..., x n } and choose another random sample {yi,..., y m } from the 
training data. In order to match the moments of the distributions underlying 
these two samples we need to consider the parameter 


1 1 

6* = arg min l W K( x i, xj) H-o 

6 l n z z —' m z z —' 


hj 


hj 


2 

mn 


Y. ]K[xi,yj ) 

hJ 


5 


where K(x,y) — cj)(x) T (j)(y) and we used formula (3.8.6). The generator net- 
work is trained using the backpropagation method. 


19.6 Summary 

We have shown how generative networks can be used either to produce sam¬ 
ples from a given probability distribution, or to generate examples that resem- 
ble data in the training set. The idea of a generative model is to provide a 
random seed to the model and to obtain an outcome that resembles the 
training data or a sample from the data distribution. 

A Central role was dedicated to GANs, a generative model whose architec- 
ture consists of two networks, a generator and a discriminator network, which 
play a competitive game. This noncooperative game is usually expressed as 
the relation between a painting counterfeiter (the generator) and the expert 
investigator (the discriminator). The counterfeiter tries to full the expert, 
while the expert intends to identify the forgery as fake. The expert trains on 
fake and real paintings. In the long mn they force each other to improve: the 
counterfeiter does such a good job such that the expert gets confused about 
whether a painting is fake or genuine. In the most common setup the game 
is a zero-sum game, or a minimax problem. The equilibrium point is a saddle 
point of the payoff. 

Another model discussed is the generative moment matching network. 
This consists of a generator network whose parameters are tuned such that 
the outcome moments match the corresponding moments of the training dis¬ 
tribution. 


19.7 Exercises 

Exercise 19.7.1 We apply the simultaneous gradient descent method for 
the minimax problem with payoff V (x, y) — xy in the discrete case when the 
learning rate is y > 0. 


608 


Deep Learning Architectures, A Mathematical Approach 


(i) Write the System update. 

(ii) Write the System state (x ni y n ) in terms of (xo,yo). 

(iii) Is the sequence (x n ,y n ) convergent? 

Exercise 19.7.2 Consider a competitive game between two players, who 
hght over the linear payoff function V(x,y) — ax + by , with a 2 + b 2 ^ 0. 
The hrst player Controls variable x and wants to minimize V(x, y ), while the 
second player Controls variable y and wants to maximize the payoff function. 

(i) Write the differential System of the learning dynamics given by the simul- 
taneous gradient decent with infinitesimal learning rate and solve for the 
learning trajectory. 

(ii) Show that there is no equilibrium solution. 

Exercise 19.7.3 Consider a competitive game between two players with the 
payoff function V(x,y) — \(x 2 — y 2 ). The player who Controls the x variable 
intends to minimize the payoff, while the player who Controls the y variables 
wants to maximize it. Show that this game has a unique equilibrium solution. 
Consider both continuous and discrete learning. 


Exercise 19.7.4 We recall that the expansion or contraction of a vector field 
U = (Ui,U 2 ) in the plane is given by its divergence di vU = d x U\ + d y U 2 - 
The velocity vector field along the learning flow (x(t),y(t)) is defined by 
U(x,y ) = (. x(t),y(t )). 

(i) Show that the velocity vector of the learning flow associated with the 
payoff function V(x,t) — x 2 + y 2 has a zero divergence. 

(ii) Prove that the velocity vector of the learning flow has a zero divergence 
if and only if the associated payoff function is of the form V(x,t) — F(x + 
y) + G(x — y), with F and G two arbitrary twice differentiable functions. 

(iii) Show that if the divergence of the learning flow is zero, then there are 
no equilibrium points along that flow. 


Exercise 19.7.5 (i) Consider a GAN with a fixed generator, G . Show that 
the optimum discriminator satisfies Dq(x) — a(a(x)), where 


e a(x) _ 


Pdata (%) 
Pmodel (%) 


and a is the logistic function. 

(ii) With the a(x) given by (i), showthat ^x~ Pdata [a(x)] = D KL (p data \\p mode i). 



Generative Models 


609 


Exercise 19.7.6 Consider a GAN with the model distribution^ mo ^ e /(x, e^). 
We assume the parameter vector 0 t(,! is obtained by the maximum likelihood 
estimation, namely 


6^ — 0 MTF — argmaxE 


6 


Xr ^Pdata 


[In Pmodel(x;0)]. 


Find a function f(x) such that 0* — 0 MLE , where 


e 


* 


argrnaxE x ^ modei [f(x)\. 


Exercise 19.7.7 Consider the GAN payoff given by (19.4.4). 

(i) Show that 

V Q(g)V = [ ln(l — D(x))VQ( g )p mo dei{x\ 9^) dx. 

J x 


(ii) Assume the generator function G : Z —> X , x = G(z,eb)) is invertible 
and differentiable. Show that 


^O(g) V ~ ^z~p code 


1 dD(x) 

dG(z)~\ 

.1 — D(G(z)) dx 

dQG) . 

x—G{z ) 


(iii) Assume D(x) — a(a(x)), where a is the logistic function. Show that the 
gradient formula in part (ii) becomes 


Vflte) F = -E 


Z^Pcode 


D(G(z)) 


da(x) 

dx 


x—G(z) 


0G(z) i 
dOG) .' 


Exercise 19.7.8 Consider a zero-sum game where one player Controls the 
variable y and wants to minimize the cross-entropy payoff 


V(y,y) = ylny+ (1 - y) ln(l -y), 


while the other player Controls the variable y and intends to minimize the 
payoff —V(y,y). Both variables take valnes in (0,1)- Find the equilibrium 
point of the game. 


Exercise 19.7.9 In order to avoid the vanishing gradient problem the fol- 
lowing choice of the generator payoff has been used: 

XV = E z ^ Pcode [lnD(G(z,6 {9) ))}. 

In this case the generator maximizes the log probability of the discriminator 
being mistaken. Find the gradient Vq{ 9 )J^ g \ Assume the discriminator is 
optimal. 













® 

Check for 
updates 


Chapter 20 

Stochastic Networks 


This chapter deals with stochastic networks, such as Hopfield networks and 
Boltzmann machines. A Hopfield network is an ensemble of perceptrons which 
interact with each other, until a certain cost function which is dehned in 
terms of the interaction between the neurons (called energy function) is min- 
imized. They are useful in solving combinatorial optimization problems. A 
Boltzmann machine is like a Hopfield network for which the perceptrons have 
been replaced by binary stochastic neurons. This way, a Boltzmann machine 
can be seen as the noisy version of a Hopfield network. They are used to 
avoid Hopfield networks to get stuck in local minima of the energy function. 
We also present the equilibrium distribution of a Boltzmann machine, its 
entropy, and its associated Fisher information metric. 


20.1 Stochastic Neurons 


We have seen that a deterministic perceptron is a computing unit, which 
accepts n inputs, aq,..., x n , and provides the one-dimensional output 


n 

y — H(w t x + b) — H[ ^ WiXi + b ) 

i =1 


1 , if ic T x + b > 0 
0 , otherwise. 


A binary stochastic neuron is a unit which accepts n inputs, aq,..., x n , and 
provides the output Y, given by the binary random variable Y E {0,1}, 
having the distribution 

P(Y — l|x) — cr(ic T x + &), 

P(Y — 0|x) = 1 — cr(rc T x + b) — cr(—w T x — 6), 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3_20 


611 



612 


Deep Learning Architectures, A Mathematical Approach 


where a is the logistic function. The conditional by x can be omitted if the 
input is deterministic. 

When the input satisfies rc T x + b> 0, the deterministic perceptron pro¬ 
vides the value 1 (with probability 1), while the stochastic neuron provides 
the value 1 with probability cr(re T x + b ). 

Fisher metric The information of a stochastic neuron about its weights 
is assessed by its Fisher matrix. This will be explicitly computed in the 
following. For any 1 < i, j < n, we have 

9ij ^[d Wi £ d Wj t\ 

= P(Y = l\x)d Wi ln P(Y = l\x)d Wj ln P(Y = l|x) 

+P(T = 0|x)<9^ lnP(T = 0|x)^. lnP(T = 0|x) 

= cr(re T x + b)d Wi ln cr(re T x + b)d Wj ln a (re T x + b ) 

+ (1 - cr(re T x + b))d Wi ln(l - a(w T x. + b))d Wj ln(l - cr(w T x + b)). 

Using the differentiation rule (ln/fx)) 7 = f'(x)/f(x) and the sigmoid prop- 
erty a' — <r(l — a), we obtain 

Qij ~ + 6)( 1 — a (re T x + b)) 2 XiXj + a (re T x + b) 2 ( 1 — cr(re T x + b))xiXj 

— a{w T y. + b)(l — cr(w T x + b))xiXj. 

The formula can be extended if write x = (1, x) T and w — (6, w) as 

Tjij — a{w T y. + b)( 1 — cr(w T x + b))xiXj , 

or, in matrix form, as g — <j(w t x + b)(l — cr(w T x + 6))xx T . Since any 
two columns of the matrix xx T are proportional, we have detg = 0. Also, 
Trace g — (1 + ||x|| 2 )cr(re T x + 6)(1 — a (re T x + b)) > 0. 

The information density in direction w — (6, w) is given by 

w gw — a(w x + 6)(l — <j(w x -|- b))w xx w 

— a{w T yi + b){ 1 — cr{w T yL + b))(Si T w) 2 

— (re T x + b) 2 a{w T y. + b)( 1 — cr(w T x + b )). 

This implies a zero density along the hyperplane {x G R n ; w T yi + 6 — 0}. 
Also, the information density tends to zero when the sigmoid saturates, i.e., 
for re T x + b —> ±oo. 

Maximum likelihood A stochastic neuron learns by maximizing the likeli- 
hood. For this, we consider the random variable U — 2 Y — 1. Then U — — 1 
for Y — 0 and U — 1 for y = 1. Moreover, 

P(E7 — 1) = P(Y — 1) = cr(re T x + 6) = a((w T x. + 6) £7) 

P(E7 — —1) = P(y = 0) = 1 — cr(w T x + 6) = cr(—w T x — b) — cf((vj t x. + 6) £7). 



Stochastic Networks 


613 


The advantage of using the new variable U is that the probability has the 
same expression in both cases. We transform the training set {(xi,zi),..., 
(x n , z n )} into {(xi, ti), ..., (x n , £ n )}, where tj — 2 Zj — 1. The optimal param- 
eters are given by 


n 


n 


(re*, &*) = argmax ][ cr((re T Xj + b)tj) — argmaxln ][ cr((re T Xj + b)tj) 

w.b w.b ^ ^ 


n 


argmax — lncr((re T Xj + 6)£j). 
i=i 


The cost function 


1 

C(rc, b) — — ln cr((w T Xj + 6)t/) 
ir 

3 = 1 

can be maximized by the gradient ascent method. This means taking a step 
of size 7] in the gradient direction at each parameter update 

w (m+l) _ w (m) 

^(m+i) _ ^( m ) + rj\7 w C(w, 6), 


with the gradient computed using the chain rule and properties of the sig- 
moid: 


dC 

dw k 

d_C_ 

db 



xNNNNi + b _yg} 

(j (( wTx j+ h )tj) 


i 

n 


^2t j a(-(w T x j + b)tj). 
3 = 1 


1 

n 


^2 x ']t j a(-(w T x j + b)tj) 
3 = 1 


Simulated annealing method A binary stochastic neuron converges in 
a certain sense to a deterministic perceptron. First, we make the following 
modification in the outcome distribution of the stochastic neuron 


P(Y = 1 |x) 
P(Y = 0|x) 


<r c (re T x + 6), 

1 — cr c (re T x + b) = <r c (—re T x — 6), 


-, with c > 0. The associated cost function is 

1 + e- cx “ 


C(w, b ) 


1 

n 


n 

JUn a c ((w T Xj + b)tj). 


where cr c (x) — a(cx) 



614 


Deep Learning Architectures, A Mathematical Approach 


For c —> 0 we obtain equiprobable States, i.e., 


P(Y = l|x) = -, P(Y = 0|x) 


1 

2 ’ 


regardless of the input value, x. This corresponds to the maximal noisy case. 

When c increases unboundedly, the function cr c (x ) tends almost every- 
where (but the origin) to the Heaviside step function H{x). For c —> oo, we 
obtain the distribution 


P(Y = 1 |x) 
P(Y = 0|x) 


H(w T yi + b ) 

1 — H(w t x + 6 ). 


This is equivalent to the fact that Y — 1 if ic T x + & > 0 and Y — 0, otherwise 
which corresponds to a deterministic perceptron. 

We note that from the annealing point of view, the constant c is regarded 
as the inverse of temperature. Increasing the value of c means to decrease 
the temperature in order to obtain a global minimum of the perceptron cost 
function. 


20.2 Boltzmann Distribution 

The thermodynamic equilibrium of a System of particles at a given temper¬ 
ature is described by a probability distribution, called the Boltzmann distri¬ 
bution. Given its importance in the study of stochastic neural networks, we 
shall provide a detailed presentation of this distribution in the following. 

Consider a thermodynamic System with N States and a random variable 
x that describes the state of the System. The possible values of the state 
variable, x, are aq, ..., aqv, which are taken with probabilities pj — P(x = 
Xj). Furthermore, we assume that each state, xp is associated with an energy 
level of the System, Ej, which is a positive real number. Equivalently stated, 
the system takes the state of energy Ej with probability pj. 

The state of the system changes due to particle interactions. By the second 
law of Thermodynamics 1 the change of the States should be such that the 
total entropy of the system increases. In the absence of any other constraints, 
the system tends in the long run to the uniform distribution, which is known 
to be the distribution with the largest entropy. This is realized for the uniform 
distribution p\ = • • • = pjy = -h. 


x This states that the entropy of an isolated thermodynamic system increases. 







Stochastic Networks 


615 


However, the particle interactions are assumed to take place at a given 
prescribed temperature, T. The temperature is proportional to the average 
energy of the System, i.e., 

N 

T ~'52pjEj. 

3 = 1 

Therefore, it suffices to search for the distribution of maximum entropy, which 
is subject to the constraint YljLi Pj-^j = with k positive constant. This 
corresponds to the state of the System with a maximum uncertainty at a 
given temperature k. 

In order to find the distribution with the largest entropy, H(p) — 

— ^2j = iPj ln pj, subject to constraints 

E Pj = 1 (20.2.i) 

3 = 1 

y jP jEj — k , ( 20 . 2 . 2 ) 

3 = 1 

we construet the following function: 


N 

F(pi,... ,p N , Ai, A 2 ) = - ^Pjlnpj + A xyY^pjEj - kj + A 2 (E^i ~ 1 )’ 

3 = 1 J =1 3 = l 


where Ai,A 2 G K are Lagrange multipliers. Since the probability vector 
belongs to a compact set, (pi,... ,p n ) G [0,1] x • • • x [0,1], and F is a contin- 
uous function, a well-known resuit States the existence of the maximum of F 
on the aforementioned hypercube. 2 Since each state is taken with a positive 
probability, it makes sense to search for the maxima in the interior of the 
domain. Therefore, we can use the variational equations 


dF 

d Pj 


— (1 + Inpj ) + Ai Ej + A 2 — 0, 


1 < 3 < N, 


which imply pj 
Hessian matrix 


ce XlEj , with c — e As 1 . This is a truly maximum since the 


d 2 F 

dpidpj 




pi 


0 


0 0 \ 


0 


1 

Pn 


/ 


2 The theorem states that a continuous function defined on a compact set is bounded 
and reaches its minima and maxima. 








616 


Deep Learning Architectures, A Mathematical Approach 


is negative definite. To show the uniqueness of this distribution it suffices to 
state the uniqueness of the Lagrange multipliers Ai, A 2 . We shall show there 
is only one pair of numbers (Ai,c) that satisfy the System of constraints 
(20.2.1)—(20.2.2). This can be written equivalently as the following nonlinear 
System: 


Gi(c,Ai) — 1 
CA(c, Ai) = k 


where Gi(c,Ai) = YljLi ce XlEj and G^c, Ai) = ce XlEj Ej. By the 

Inverse Functions Theorem, see Theorem F.l in Appendix, the System has 
a unique solution provided its Jacobian is nonzero. The Jacobian can be 
evaluated as 



dG 1 

dG 1 


dc 

d\\ 


dG 2 

dG 2 


dc 

dX± 



^ e A1 Ej ^ ce Xi Ej jg. 

J2e XlEj Ej Y.cz XlEj E 2 j 




which is a consequence of Canchy’s inequality 



N 2 N N N N 

= ( 51 VE(VE E d) < = Z p i E J 


3 = 1 


3 = 1 


3 = 1 J =1 


3 = 1 


The partit ion function If we dehne the partition function Z — YljL 1 e XlEj , 
the solution can be also written as 

gAi Ej 

Pj = —Z~, 1 <j<N. (20.2.3) 

The partition function, Z, can be computed as in the following. Differentiat- 
ing the function Z(Ai) with respect to Ai, we obtain 


d 

d\\ 


N 


Z( Ai) - Z 

3 = 1 


N 

e XlEj Ej = y ZpjE k = kZ, 
3 =1 


where we used that Ej=iPj^k = k. Therefore, Z( Ai) is the solution of the 
differential equation 



= kZ 
= N, 

















Stochastic Networks 


617 


and hence, Z = Ne kXl . Solving for Ai = 
(20.2.3) yields 


Pj 


In (N/Z) p 

e k 

Z 


E, 


— e t 


and substituting into 


where the temperature T is a physical measure proportional to the average 
energy of the System, k. In the literature this is called either the Boltz- 
mann distribution or the Gibbs distributiori [72] and it was introduced by 
Boltzmann, [112]. This is the probability distribution that characterizes a Sys¬ 
tem in thermal equilibrium with N different States and associated energies 
E\, E 2 -,. • •, Ejy. The System takes States of lower energy with higher prob- 
ability and the probability decreases exponentially with the increase in the 
energy levels. 


20.3 Boltzmann Machines 

A Boltzmann machine is a neural network made out of mutually connected 
binary stochastic neurons (see section 20.1) with symmetric weights, W{j — 
Wji , and wa — 0, which make stochastic decisions about whether to be on or 
off. A Boltzmann machine with n neurons defines a probability distribution 
over the state space T = {0, l} n , which is parametrized by an energy func- 
tion, which describes the interactions within the model. More precisely, we 
have: 

Definitiori 20.3.1 A Boltzmann machine is a set of n noisy neurons with 
states xi,... ,x n , which form a network with symmetric weights. The state 
of the ith neuron is updated stochastically according to the rule 

_ J 1, with probability pi] 

1 \ 0, with probability 1 — pi. 


where 


Vi 


= 0~1 IT\y. W jiXj + b i) = 


-(E,' WjiXj+bi)/T 


J 


1 + e 


(20.3.4) 


The constant T > 0 denotes the temperature and Wij — wj\ are the weights 
between the ith and jth neurons, with self-recurrent connection wa — 0. The 
constants bj denote the bias of the jth neuron. 


o 

The version used in Physics includes a Boltzmann constant, k > 0, which we considered 
equal to 1. The distribution actually writes as p(x) = T e - £; ( x )/( KT ) i 








618 


Deep Learning Architectures, A Mathematical Approach 


Moreover, the probability that the ith neuron remains inactive is 


1 -Pi = vi/t(- E w i iX i + = 


WjiXj+bP/T 


J 


1 + e 


where we used the complementarity property of the sigmoid function, <r(x) + 
(j(—x) — 1. 

The state of a Boltzmann machine is described by the vector x = (xi,..., x n ) T . 
Since each neuron activates binary, xi E {0,1}, it follows that there are 
N — 2 n states that a Boltzmann machine can take. 

Boltzmann machine as a thermodynamical system We shall introdnce 
first an energy function depending on the network state. The signal poten- 
tial of the zth neuron is defined by the action of all other neurons on itself, 
including its own bias, as U{ — — ^ • WjiXj + bi). The minus sign is included 

for convenience (to end up with a distribution resembling BoltzmamTs dis- 
tribution) and the factor ^ is included because half of the weight wij counts 
for the zth neuron and the other half counts for the jth neuron. The activa- 
tion (either 0 or 1) of all other neurons contribute additively to the potential 
through the synaptic strength given by the weight W{j. The energy of the 
ith neuron is the product between its activation, x^, and the signal potential, 
ei — UiX{. The total energy of the network state x = (xi,..., x n ) T is the sum 
of all individual neuron energies in that state 


E(x) 



E 


UiXi 


i 



^ ^ WjiXjXi 

hj 



^ ^ WjiXjXi 
%<j 



In matrix form we can write E(x) — — ^x T rcx — x T &. Hence, the associated 
energy function is qnadratic with respect to the neuron activations X{. 


Consider now a thermodynamical system with N — 2 n states, xi,..., x^y- 
Each state corresponds to an energy level Ej — E(xj), where E is the 
qnadratic energy function previously defined. 

The Boltzmann machine is assumed to have a certain state at time zero. 
Through the stochastic update rule the entropy of the system increases until 
the thermodynamical equilibrium is achieved. The corresponding equilibrium 


distribution is the Boltzmann distribution pj — 


E j /T 

z 


, 1 < j < n, and this 


is achieved regardless of the initial state of the machine. 

There are many ways one can endow a network with an energy function. 
However, the qnadratic function introdnced before seems to be the natural 



Stochastic Networks 


619 


choice, since it is compatible with a transition probability from Thermody- 
namics, which States that the probability of transition from the state with 


i 


energy Ei to the state Xj with energy Ej is given by pij — -^—( Ej -E x )/T- 


l+e' 


Assume the network updates from the state x = (xi,..., x&,..., x n ) with 
energy E — £7(x) to the state x.' — ( tJO ~| «i • • • «i tZ/ j^, ^ ^ tZ/ ^ ) with energy E' — 
£7(x 7 ). By Exercise 20.10.1 the difference between the energy levels is 


n 


E’ ~ E = - f E WkiXi + bk ) ( X ’k ~ X k) 

i —1 


Assume the state x& — 0 updates to ^ = 1 . Then the difference becomes 
E'-E = -(T!i l =1 WkiXi + bj^ and the update formula (20.3.4) can be written 

1 1 


as 


P = 


1 + e -(E?=i w ktXl +b k )/T I + e (E'—E)/T ’ 

which agrees with the aforementioned transition probability formula. 

If the state x^ = 1 updates to x' k — 0 , then the difference becomes 
E' — E — X^=i w ki x i + bk and we obtain the transition probability again 

1 1 


1 — p — 


1 + e (£?=i ™ klXl +b k )/T I + e (E'—E)/T • 


To facilitate understanding, we shall consider an example. 


Example 20.3.2 Consider a network with three neurons, having symmetric 
connections and wn — 0. Their signal potentials are given by 

Ui = ~(W12X2 + W13X3 + bi) 

u 2 = — ( 1 ^ 12 X 1 + W 23 X 3 + 62 ) 

^3 = -(^13^1 + ^23^2 + ^3)- 

Then the energy of the network is given by 

£7(x) = -{W12X1X2 + Wx 3 x X x 3 + W23X2X3 + 61^1 + &2^2 + & 3 ^ 3 )- 
There are N — 2 3 — 8 States of the System 

x = {(0,0,0), (1,0,0), (0,1,0), (0,0,1), (0,1,1), (1,0,1), (1,1,0), (1,1,1)} 


corresponding to the following energy levels: 


E x = £7(0, 0,0) = 0 

E 2 — £7(1, 0, 0) — —b\ 
E 3 — £7(0,1,0) —62 

£74 — £7(0,0,1) — —63 


£75 £7(0,1,1) = —W 23 — b 2 — £3 

Eq = £7(1, 0,1) = -W 13 - bx - 63 
£77 = £7(1,1, 0) = —W 12 — b\ — 62 
£7 8 = £7(1,1,1) = -wx2 ~ wx3 ~ w 2 3 -bx~b2- b 3 . 







620 


Deep Learning Architectures, A Mathematical Approach 


• • • ^3 

The Boltzmann distribution, pj — e z , for the case T — 1, is given by 





e W23+b2+b 3 e Wi3+bi+b3 e Wi2+bi+b 2 e W\2+V0l3+W23+bl+b2+b3^ 


where Z is a normalization factor. We note that all energy levels (with the 
exception of the hrst one) depend on the network weights and biasses. Hence, 
any change in the network parameters leads to a change in the energy levels, 
and therefore to a change of the Boltzmann distribution. 

One might be interested to learn an arbitrary distribution q on the States 
space T using a Boltzmann distribution by adjusting the network parameters. 
In the present case this cannot be done exactly, since we need to learn 7 
unknowns, gi,..., qj (since q% depends on the others) using only 6 parameters, 
b \, ^ 3 , ^ 12 , ^13, and ^ 23 . The Boltzmann learning algorithm will provide 

a way of learning q in an approximate way by minimizing a Kullback-Leibler 
divergence. 


However, there are cases when the learning is exact. We shall deal with 
this in the next example. 


Example 20.3.3 (exact learning) Consider a Boltzmann machine with 
n — 2 neurons. The N — 2 2 States are given by 

Af = {(0,0), (1,0), (0,1), (1,1)}. 

The associated energy is 


E(x) = -W 12 X 1 X 2 - b\xi - 6 2 x 2 , 


and the energy levels on each state are given by 


E( 0, 0) = 0, E( 1,0) = - 61 , E( 0,1) = - 62 , E( 1,1) = —W 12 -b\- b 2 . 
The thermal equilibrium at T = 1 is realized for the Boltzmann distribution 


P = ^(l,e x ,e 2 ,e 


bi J) 2 „wi 2 +bi+b 2 


)• 


We consider now an arbitrary distribution q on X 


qi=q(0 ,0), qi = q(l, 0 ), q 2 = q(0 ,l), 93 = 9(1,1) 



Stochastic Networks 


621 


and we shall find the exact values of parameters w,b such that p — q, i.e., 
the Boltzmann machine learns q exactly. We identify parameters as 


Pi = 77 = Oi 


Z 


P2 = 


P3 = 


Z 

cb 2 


P4 = 


Z 

,wi2+bi+b 2 

Z 


= 0.2 


= 03 


= 04 


Z = 1 /qi 


b\ — ln 


b 2 = ln 


92 

Oi 

03 


Oi 


w \2 — ln 


0104 

0203 


Given the distribution g, we can write an exact expression for the Boltzmann 
distribution, which learns exactly g, as 


p(x) = 


E(x) 


Z 


— q^ e Wl2XlX2+b 1 Xl J l-b2X2 _ q i t e W 1 2\XlX2 t e bl\Xl ( e b 2\ X2 


(<71 <74 \ X1X2 (<?2 \ X1 (<?3 \ X2 

= qi{ —) y y 


^0203 


— 9 i 0 2 O3 O4 


This exact learning of a distribution q works just for n — 2 neurons, since in 
this case the number of equations is equal to the number of variables. 

In the general case the number of unknown variables in distribution p 
is n{n + 1)/2 (we used symmetry conditions W{j — Wji , wu — 0, and also 
included &*.), while the number of equations is 2 n — 1 (we snbtracted 1 to 
account for the linear relationship ^qi — 1). For n > 2 we always have 
n{n + 1)/2 < 2 n — 1, with equality only for the case n — 2. 


20.4 Boltzmann Learning 


Boltzmann machines can function as approximators of distributions defined 
on the state space see [2]. In order for a Boltzmann machine to learn a 
given distribution q defined on the state space df, we need to choose among 
all distributions p generated by the machine the distribution p*, which is the 
closest to q in the sense of the Kullback-Leibler divergence, i.e., 


P 


* 


argmin Dkl(o\\p) — argmin 

p p 


Z q ln 

xGX 


g( x ) 

p(x)‘ 


Since the entropy H(q) — 9( x ) l n 9( x ) is independent ofp, the previous 

search is equi valent to 



= arg max 
p 


E 


q(x) ln p(x). 













622 


Deep Learning Architectures, A Mathematical Approach 


~E(x) 


The idea is that the Boltzmann distribution, p(x) = e Z ^ J , changes when 
the machine parameters change. The parameters are updated following a 
gradient ascent method. Using chain rule, we hnd hrst the derivative of the 
log-likelihood function 


d 


dwij 


lnp(x) = — 


£(x) - T lnZ(x) 


dwij 


dw^ 


d ,1 T Tlx 1 

(-X wx + x b) — 


d 


dwij 2 


- /T» . rp . _ 

- tAy 2, «T j 


E 


d 


Z(x) dw 

-E(x) 


Z(x) 




Z(x) ^ dwjj 


- rp . rp . _ 

- «X/ 2 , X j 


T. p{x)XiXj 
xG<r 

= XiXj — K p [xiXj 


Similarly, we have 


a 


dbj 


lnp(x) = 


-l- £(x) -^ inz(x) 


= x i 


- p ( x ) 
xGV 

= Xj — E^far,- 


Xj 


The gradient components of the cost function C(w, b) — VJ q(x) lnp(x) are 

x6X 

computed now as 


d 


d 


dw^ 


C(w,b ) = ^q(x)——lnp(x) 

' l 3 


xGX 


— Q( x ) x i x j — q(x)W > [xiXj 

xex xex 

= ¥P\xiXj\ — ¥P[xiXj\. 


d 


d 


dbj 


C(w,b ) = ^g(x)—lnp(x) 


XG1 


= “ Z 9( x ) eP [®j] 

xex xex 

= E q [xj\-W[xj. 



Stochastic Networks 


623 


The learning follows the adjustment rule 

d 


Awij = rj- - C(w, b) 

OWij 

Ab i = v4~ c ( w , b ), 


db 


j 


which recovers the learning rule obtained in [2] 

A Wij — r](K q [xiXj] — K p [xiX 
A bj — rj(K q [xj\ — K p [xj\ 


j 


where 77 > 0 is the learning rate. It is worth noting that the learning rule has 
two phases: in the hrst phase the connection weight is increased by the 
average activation of X{ and Xj under the given distribution, g; in the second 
phase, the connection weight w\j is decreased by the average activation of X{ 
and Xj under the Boltzmann distribution, p. 


Remark 20.4.1 Changes in weights and biasses lead to perturbations in the 
Boltzmann distribution. Its sensitivity with respect to parameters is given by 


dp(x) — p(x)dlnp(x) 

= **)E^**+*oE^ 


h3 


3 


9b, dh ’ 


— p(x) ( XiXj — E p [xiXj]) dwij + p(x) fxj — E p [xj\ j dbj 




3 


20.5 Computing the Boltzmann Distribution 

The Boltzmann distribution is an equilibrium distribution, i.e., regardless of 
the initial state choice, the machine settles in the long run to the same distri¬ 
bution. In this section we shall directly compute the Boltzmann distribution 
for a Boltzmann machine having two neurons, using a limiting procedure. 
Starting from an arbitrary initial state, x° — x ^, x 2 ), adjusting the state 
according to the updating rule (20.3.4) of a Boltzmann machine, we obtain 
a seqnence of States x n = whose distribution will converge to the 

Boltzmann distribution. Thus, it suffices to compute the distribution of the 
sequence of States x n and then take n —^ 00 to obtain the equilibrium distri¬ 
bution. This procedure does not use the results introdnced in section 20.2. 

In this model we have 3 parameters, 61 , 62 , and w — w \2 = 1 ^ 21 - Let 
a n — P(x 2 = 1). In a hrst st age we shall find a recurrence relation for the 
sequence a n and also hnd its limit. 










624 


Deep Learning Architectures, A Mathematical Approach 


The state of the second neuron at the {n + l)th step, x 2 +1 , depends on 
the state of the first neuron at the (n + l)th step, as in the following: 


pH-i __ 
x 2 — 


1, with probability cr (wxg 1 + b 2 ) 

0, with probability 1 — a (wxp 1 + b 2 ). 


Using the probability chain rule 4 we can express the probability of the second 
neuron state in terms of the first one using conditional probabilities as 


P(+ +1 = 1) = P(X 2 + i = l|+ +i = l)P(xp L = 1) 

+P{x™ +1 = 1|+ +1 = 0 )P(xg 1 = 0) 


, 71+1 _ 


, 71+1 _ 


, 71+1 _ 


= a(w + b 2 )P(x™+ 1 = 1) + a(b 2 )P(x r l^ i — 0) (20.5.5) 


71+1 


Now, the state x depends on x <2 as 


pH -1 _ 

Jb ^ - 


1, with probability cripjox^ + b\) 

0, with probability 1 — a(wx 2 + &i). 


The probability chain rule yields 


P (^ +1 = 1 ) = 


P(x^ +1 — 11^2 — 1)P(X2 — 1) 

+ P(^+ 1 = 1|^ = 0)P(^ = 0) 
c j(w + bi)P(x2 = 1) + cr(bi)P(x2 — 0). (20.5.6) 


Substituting (20.5.6) into (20.5.5) yields the first-order recurrence 


^7i+i — c^Qjji T (3. 


(20.5.7) 


with 


ol = (a(w + b\) - cr(6i)) (cr(w + b 2 ) - cr(& 2 )) 
/3 = cr{w + b 2 )cr(bi) + a(b 2 )a(-bi). 


Since a is increasing, it follows that a > 0 for w ^ 0. Using the Mean Valne 
Theorem we can estimate the following upper bound: 


a — a' (ci)wa' (c 2 )w < 


a 


00 


ur 


2 

re 

16 ’ 


where we used that the largest slope of a is 1/4. Since a E (0,1), we have 
0 < /3 < 2. 


4 For any two events A and B we have P(A) = P(A,B) + P(A,B C ) = P(A\B)P(B) + 
P(A\B C )P(B C ). 










Stochastic Networks 


625 


The recurrence (20.5.7) can be solved inductively as 


&n+1 — + /3(1 + OL + • • • + Cp) 

, 1 _ n/ n+1 

n+l ~ ^ 


= <a nih a 0 + /3- 


1 — a 


Assume w 
hence, a n 


< 4 (which implies the stability condition). Then 0 < a < 1, and 


0, as n —> oo. Therefore a 


n 


/3 


1—a 5 


or 


lim P(^2 — 1) 

n—>oo 


P 



The next goal is to find the following equilibrium distributioni 


p(0,0) 

p( 0,1) 
p( 1,0) 
p( 1,1) 


lim P( 

n—>oo 


lim P( 

n—>oo 


lim P( 

n —>oo 


lim P( 

n—>oo 


X 


n 

1 

Xi 


X 

x 1 


1 

n 


= o,s5 

= 0,X? 
= 1,^2 
= 1,*2 


0 ) 

1 ) 

0 ) 

1 ). 


Taking the limit n —> oo in the following conditional probability relations 


P(x1 +1 = 0, a# 

p(py 1 = o, 


0 ) = P(xp 1 
i) =p( x y 1 

o) = p(py 1 

i) =p(py 1 






0)P(X2 = 0) = a(—bi)P(x2 = 0) 
1 )P(X 2 = 1) = a(-w - bi)P(x 2 = 1) 
0)P{x2 = 0 ) = a(bi)P(x2 = 0 ) 
l)P{x 2 = 1) = cr{w + bi)P(x 2 = 1) 


provides the equilibrium distribution 


p(0,0) 

p( 0,1) 
p( 1,0) 
p{ 1,1) 


cr(- 6 i)(l - 

cf(—w — b\) 


P 


1 — a 

p 

1 — a 

/3 


c j(w + bi) 


1 —a 

/3 

1 — a 


We shall present next two ways of assessing information on a Boltzmann 
manifold, using the entropy and evaluating the Fisher information. 















626 


Deep Learning Architectures, A Mathematical Approach 


20.6 Entropy of Boltzmann Distributiori 

The Boltzmann distribution is the distribution with the largest entropy on 
the state space T, given a fixed value of the average energy E p [E(x)] — k. 
The entropy can be computed as in the following: 


h{jp) = - ^ p( x ) in p( x ) = - E p ( x ) ln 


xG V 


xG V 


_S(x) 

e t 

Z 


T 


E p( x ) E ( x ) + E p^ ln z 


x£zX 




= t E P[ J B(x)]+lnZ=|+lnZ. 

Since in Physics the entropy is determined up to an additive constant, the 
constant term, ln Z can be neglected. Then the entropy becomes the quotient 
between the average System energy, E p [E(x)], and the temperature, T. 

Let Punif denote the uniform distribution on T, i.e., p U m/( x ) = Its 
entropy 

_ ^ 2 

H (Punif) = ~ 2J ^ ln Jj = ln V 

x 

is the largest among all entropies of distributions on X. Since 


D K L(p\\Punif) = 

= ln N-H(p), 


H(p) - E p(x) ln 


N 


it follows that the difference between the largest possible entropy and the 
entropy of the Boltzmann distribution is given by 

H (Punif) H(jp) DrL (p\ \Punif) ^ 0 * 

The left term represents the reduction of the maximum entropy due to the 
constraint on the average energy, E p [E(x)] = k. The right term shows that 
this is given by the Kullback-Leibler divergence. 


20.7 Fisher Information 

Any Boltzmann machine dehnes a probability distribution of the form p(x) = 

g—E(x)/T 

-—-, x <E X. Conversely, any distribution of this form dehnes uniquely a 

Zj 

Boltzmann machine. Thus, the family of Boltzmann machines can be identihed 



Stochastic Networks 


627 


with the family of Boltzmann distributions, and hence can be parametrized 
by Wij and b k . The Riemannian metric on the associated manifold having 
coordinates (wij,b k ) is given by the Fisher metric. Using the computation 
from section 20.4 


d Wij lnp(x) 
d Wkl lnp(x) 
d bk lnp(x) 
d bl lnp(x) 


XiXj — E p [xiXj\ 
x k xi - E p [x k xi} 
x k - W[x k } 
xi - E p [xi\, 


the linearity of the expectation operator yields 


9ijM w > b ) 


9k,r( W ’ b ) 


^ p [d Wij lnp(x) d Wkl lnp(x)] 

E p [xiXjX k xi\ — K p [xiXj]K p [x k xi] 

Cov(XiXj , X k Xi). 

E p [d bk lnp(x) d br lnp(x)] 
E p [x k x r \ -E p [x k ]E p [x r \ 

Cov(x k , x r ). 


We note that the Fisher information depends on the correlations of neuron 
activations Xj and is independent of the weights and biasses, i.e., of the 
neural manifold coordinates. 5 This implies that the derivatives of the metric 
coefficients vanish 

dgjj, k i(w,b) _ dgi jik i(w,b) _ dg k , r {w,b) _ dg k , r (w,b) _ 
dwik db r dwij dbj 

Consequently, all Christoffel symbols (13.1.2) vanish. Then the associated 
manifold is intrinsically flat (the Riemannian curvature tensor is zero). The 
geodesics equations (13.1.1) on this manifold become c a {t) — 0, which means 
that the geodesic components c a (t ) are affine functions in t. Since the distance 
between the initial point and the optimal point on the manifold cannot be 
smaller than the length of the geodesic, the previous relation provides a 
lower bound on how fast learning can be accomplished on this manifold. 
This situation is similar to the geometry of the neuromanifold associated 
with a linear neuron described in section 14.4. 


5 Something similar happens with the Euclidean metric in an Euclidean space, IR n . These 
types of metrics are called translation invariant. 







628 


Deep Learning Architectures, A Mathematical Approach 


For efficiency reasons, the Fisher matrix can be used in combination 
with the natural gradient learning algorithm introduced in section 14.8. The 
updating rule in this case becomes 


A Wij 


k,l 


kl 


d 


r][E q 


dw ki 
i,kl 


C(w,b ) = V g lJ,kl 

k,l 

,kl 


E q [x k xi\ - E p [x k xi] 


k,l 


— 


y ^g tjy "x k xi 


k.l 


d 


Ab k = 7jy2g kl —C(w,b) = 7jy2g kl iE q [xi]-E p [xi} 


dbi 

i 

g(E q y^ g' d 


xi 


-E p 


yy g H xi 


20.8 Applications of Boltzmann Machines 


Boltzmann machines are mainly used for solving combinatorial optimization 
problems and carrying out learning tasks. 

1. Distributions approximator As we have seen in section 20.4, a Boltz¬ 
mann machine has the capability of learning any discrete distribution, g(x), 
on the state space T using a distribution of the exponential form 

g-(x T wx+6 T x)/T 

p( x ) =- t?-• (20.8.8) 

In order to refine this resuit, we approximate the distribution q(x) by a convex 
combination of distributions of type (20.8.8) 

m 

q ( x ) ~ yy^mix), 

i —1 


with qiiyx) — 


(x J w(i)x-\-b(i)x.)/T 


The Boltzmann machine B(i ) defined by 


coordinates (w(i),b(i)) learns the distribution qi(x). Consider now a neural 
network which combines the Boltzmann machines B(i). Its output, y{yx) — 
YliLi c^(x) is an approximation of the distribution q(x). 


Remark 20.8.1 ( i ) It has been shown that Boltzmann machines with hidden 
units are universal approximators of probability mass functions over discrete 
variables, see [103]. 

(ii) A similar approximation resuit holds for the continuous case. The class 
of combinations of exponentials is well known to be dense in the set of dis¬ 
tributions on (0, oo). Interested readers are referred to [35]. 














Stochastic Networks 


629 


2. Simulated annealing method A Boltzmann machine is a network of 
mutually connected binary stochastic neurons. When the temperature param- 
eter T \ 0, each stochastic neuron becomes a regular perceptron, see section 
20.1. Hence, the Boltzmann machine tends to a neural network made out of 
perceptrons, which is called a Hopfield network. We shall briefly present this 
type of network in the following. 

Hopfield networks These type of networks had been introduced by the 
physicist John Hopfield in 1982, see [57]. The physical importance consists 
of the fact that a Hopfield network is isomorphic to the Ising model of mag- 
netism at zero temperature, see [61]. 

A Hopfield network consists of n perceptrons' that preserve their individ- 
ual States until they are selected for a new update, which is made at random. 
The perceptrons are totally coupled, that is, each neuron is connected with 
all the others, except itself. 6 * 8 The weights between the ith and the jth neuron, 
Wij, are symmetric, and can be modeled by a symmetric matrix w with zero 
diagonal entries. An example of a Hopfield network with n — 6 neurons is 
shown in Fig. 20.1 a. 

The network starts with an initial state, x° ^ 00 ~| j j 00 E {0, l} n , and 
the updates occur one at a time (i.e., they are asynchronous). Assume the 
jth neuron is chosen for an update. The effect of all the other neurons on the 
jth neuron, including its bias, is Xa/j w %j x % + bj. The valne of the neuron is 

updated to the valne of the Heaviside function H ( Xa/j w ij x i + bj ) , which 
either 0 or 1. 

The energy associated with a Hopfield network evaluated at the state x 
is the same as the one in case of a Boltzmann machine 


£(x) 


1 T Ti 

—x rex — x b 

2 

1 \ 


- WniXnXi ~ 


rj 



^ ^ ^XjiXjXi 

i<j 


^ ~2 biX k • 

k 


The task of a Hopfield network is to minimize the aforementioned energy 
by the updating procedure. The convergence of the network to a stable state, 
which corresponds to a local minimum of the energy function, is shown in 
the following. Consider the update from the state x = (xi, ..., x&,..., x n ) 
to the new state x' = ( 00 "]«)•••«) 00 j^, ^ ^ 00 r y^ ). By Exercise 20.10.1 the energy 


6 Professor at Princeton University since 1964. 

In our case the perceptrons take values 1 and 0, while in Hopfield’s approach they take 
values 1 and —1, because his model was derived from a physical model where particles have 
a spin that is either “up” or “down”. 

8 This avoids a permanent feedback of its own state value. 



630 


Deep Learning Architectures, A Mathematical Approach 


difference is 

n 

E(x') - £(x) = - f ^2 w ki Xi + b k ) (:4 - x k ). 

i=l 


We consider two cases: 

(i) If the kth neuron updates its value from Xk = 0 to x' k = 1, then we have 

:1 w^Xi + 6^) = 1, which implies ^7=1 w ki x i + 6& > 0. In this case 

n 

E(x') - E(x) = - fy~] w ki Xi + (1 - 0) < 0. 

i=l 

(ii) If the kth neuron updates its value from Xk = 1 to x' k = 0, then we have 
H( Yd=l w ki x i + h) = 0, so Yd=l Wki x i + 6/e < 0. Then 

n 

E(x') - E(x) = -( WkiXi + b k ) (0 - 1) < 0. 

i= 1 

Then, every time a neuron state is updated, the total energy decreases. Since 
there are only a Unite number of States that can be taken, 2 n , it follows that 
at some point the energy cannot be rednced any further, which corresponds 
to a state of minimum energy. 

Sometimes, we arrive at a minimum that is not an absolute minimum. In 
order to avoid getting stuck in local minima of the energy function, we add 
noise to the System. This is done by transforming the Hopfield network into 
a Boltzmann machine and then use the simulated annealing method with 
T \ 0 to approach the global minimum of the energy function. 

Example 20.8.2 (The n rooks problem) We now get back to the problem 
introduced in section 1.3. We make it here a little more general by asking 
to place n rooks in an n x n chess board so that no one endangers another. 
This optimization problem can be solved using a Hopfield network. First, the 
objective function, which needs to get minimized, is the sum 

n n 2 n n 2 

D(x ii,..., x nn ) — ^ ^ ^ ^ ^ Xij — 1^ + 'y ^ ^ y ^ Xij — 1^ , 

j= 1 i= 1 i=l j= 1 

where x^ is the state of the (i, j)th square. The state is 1, if there is a rook 
placed at that spot and 0, otherwise. The minimum value of E(x n,..., x nn ) 
is zero and this is accomplished when there is only one rook on each row and 
on each column. Since x^ G {0,1}, then x\- — x^. We shall show that some 


Stochastic Networks 


631 


algebraic manipulation reduces the function £(xn,..., x nn ) to the energy of 
a Hopfield network. We start expanding the square 


n 


n 


n 


yy x ij i) _ (yy ^ yy+i 

z=l i=l z=l 


n 


n 


= X 4; + 2 X Xi i Xk i “ 2 X + 1 

k^i 


i =1 


z=l 


n 


— 2 ^ ^ XijXfcj ^ ^ X^j T 1* 

k^i 


i —1 


The first term of E is evaluated as 


n n ^ n n 

Fi = yy ( yy =2 yy yy x^x^j — yy 


x^j + n 


3 =1 i=l 


J=1 fc/i 


M=1 


Similarly, the second term of E becomes 


n n 


n 


n 


F2 — yy ( yy x^- 1^ — 2 yy yy x^xij yy 


x^j + n. 


i=l J =1 


i=l fe/j 


M=1 


Then 


Fj (x 11 ^... 5 x nn ) — 


Fi(xn ,..., x nri ) + F 2 (x n 

5 • • • 5 'Knn) 

n n 

— 2 j yy yy x^j + yy yy x^xij 

j= 1 /Cy^Z Z = 1 /cyO/ 

1 n 

= “ 9 X/ W a,0 X <* X l3 - X + 2n ' 


where b Q — 2 and 


^a^/3 


n 


2 jy Xzj + 2n 
%j= 1 


q;,/3 


cn 


—4, if <nand /3 are placed on the same row or column; 
0, otherwise. 


Neglecting the irrelevant constant 2n, the rest of the expression is the energy 
of a Hopfield network, see Fig. 20.1 b. The bias of each unit is 2 and the 
connection weights are nonzero only for horizontal and vertical connected 
neurons. 

Each of the n 2 squares of the board corresponds to a perceptron, and 

2 

hence the network can take 2 n States. The state of a neuron is 1 if there 
is a rook placed in that square and 0, otherwise. Each rook’s conhguration, 


632 


Deep Learning Architectures, A Mathematical Approach 



a b 

Figure 20.1: a A Hopfield network with n — 6 neurons; b. A Hopfield network 
associated with a 3x3 chess board has nonzero connection weights only for 
vertically or horizontally connected uniis. 


x = (xn,..., x nn ) corresponds to a state of the Hopfield network. The stable 
state of the network minimizes the energy, and hence provides a solntion for 
the rooks’ problem. The learning algorithm consists in choosing a square at 
random and making an update. It is worth noting that regardless of the initial 
state of the Hopfield network the final configuration of the problem will con- 
tain only n active perceptrons, i.e., there are n rooks that solve the problem. 


Example 20.8.3 The people of a community have to vote for or against a cer- 
tain candidate leader. We shall model the community behavior as a Hopfield 
network. The state of a member who votes for the candidate is 1. If he votes 
against, the state is 0. The mutual influence between the zth and jth members 
is given by the weight Wij. Each individnal member has his/her own thresh- 
old of belief, denoted by —b{. If the input belief influence on the zth member 
from the other members is larger than his individual threshold of belief, i.e., if 
JT WijXj > —b {, then the zth individnal will vote for the candidate, i.e., will 
have associated the state 1. Otherwise, if ^ • WijXj < —bi , he will vote against, 
i.e., has the value 0. In both cases, the state of the ith individual is given by the 
output of a a perceptron H (JT w^Xj + bf). Thus, each member can be consid- 
ered as a perceptron, and the community as a Hopfield network. 

In the long run the state of the network, x, which is a seqnence of Os and 
ls, is maximizing the quadratic function /(x) = ^x T rcx + x T 6. Hence, if one 
knows the mutual influence between the members, and the individnal 




Stochastic Networks 


633 


thresholds of belief, 6^, one can find the state of the community. Therefore, 
the candidate will get a number of JT X{ favorable votes. If X{ > 1 + n/ 2, 
then he wins the election. Here, n is the size of the community. 

20.9 Summary 

A binary stochastic neuron is a neuron whose output is a random variable 
taking valnes in {0,1}. The probability of taking the value 1 is given by a 
sigmoid function applied to the input of the neuron. A Boltzmann machine 
is a neural network that consists of an ensemble of symmetrically connected 
stochastic neurons in complex interaction with all the other neurons in the 
net. An energy function is introdnced to harness the complexity of the model. 
The stable state of a Boltzmann machine is described by an equilibrium dis- 
tribution parametrized by the energy function, called Boltzmann distribu- 
tion. When the temperature parameter tends to 0, the Boltzmann machine 
becomes a Hopheld network, that is, a network made out of perceptrons. 
Hopfield networks are useful for solving complex combinatorial problems. To 
avoid falling into a local minimum, a Boltzmann machine is used instead 
in combination with the simulated annealing method for obtaining a global 
minimum of the energy function. 

20.10 Exercises 


Exercise 20.10.1 Let x = ( aq,..., ..., x n ) and x 7 = ( r ^,..., r ^,..., r ^ ^ 
be two States of a Boltzmann machine, with corresponding energies E — 2£(x) 
and E' — E(x.'). Prove that 

n 

E ' ~ E = ~{N W ki x i + b k) (x' k - X k ). 

i—1 

Exercise 20.10.2 Consider all notations introduced in Example 20.3.2. If 
q — (#i 3 ..., qg) is a distribution on the state space X show that the Boltz¬ 
mann machine can learn exactly the distribution q if and only if 

<75 <76 <77 

qs =-• 

<72 <73 <74 

Exercise 20.10.3 The state transition matrix Pt — Pij of a Boltzmann 
machine with n neurons is a N x N matrix, with N = 2 n , whose elements 
Pij represent the probability of the transition from the state j to the state i 
in a single step at the temperature T. This is defined by 

if * + 3 


Pij 


1 + e {E i -E j )/T' 




634 


Deep Learning Architectures, A Mathematical Approach 


PU 1 ^2 l + e (E,-E l )/T- 

(a) Prove that the Boltzmann distribution p — ,..., e ~ E N/Ty - g a 

fixed point for Pt , i.e., PtP — p (or equivalently, p is an eigenvector of Pt 
with eigenvalue equal to 1). 

( b ) Show that the largest eigenvalue of Pt is equal to 1. 

(c) Prove that for any initial state qo , the sequence ( q n )n , defined recursively 
by q n +1 = Prqm converges in R^ to the Boltzmann distribution, p. What is 
the physical significance of this? 

Exercise 20.10.4 Find the explicit formula for the Fisher information matrix 
in the case of a Boltzmann machine with 2 neurons. 


Exercise 20.10.5 Let Wij , />/,. be coordinates on a Boltzmann machine and 
consider the linear operator 



1 

2 



d 


'Wij 


dwij 



d 

db k 


(a) Show that 


A W)b p{yi) = (^IE p [E(x) 



(■ b ) Consider the smooth deformatione( x ) ofp(x) induced by the exponential 
transformation of coordinates wij(t) — Wije at , b^it) — with a > 0. 

Show that 


d 

dt 


Pt(x) = a(E p [E(x ) 



(c) Consider the following evolution equation of a Boltzmann distribution 


d 

dt 


Pt(x) 


Po(x) 


aA Wtb p(x), a > 0 

P(x), 


which corresponds to a curve (w(t), b(t)) in the parameters space. Find the 
components Wij(t), b^{t) of this curve. 


Exercise 20.10.6 8 distinet rooks are placed on a 8 x 8 chess board at ran- 
dom. Find the probability that all rooks are safe from one another. 


9 


This can be considered as a vector field on the neural manifold. 









Stochastic Networks 


635 


Exercise 20.10.7 (Restricted Boltzmann machines, [114]) We consider 
the more general situation when the neurons of a Boltzmann machine are 
divided into two groups, visible neurons, v, and hidden neurons, h, and there 
are no connections between the units of the same group (hence the name of 
“restricted”). Thus, the state of the machine is x = (v, h) G V x T~L — X. The 
energy is dehned by 


22(v, h) = —v T a;h — b 1 v — c 1 h 

2 


T, 


T- 


and the joint probability of the visible and hidden neurons by 

P(y, h) = ±e~ E ^ h \ 


where Z is the partition function. 


(a) Show that the hidden States, hj G 72, are conditional independent given 
the visible States; 

(b) Find the conditional probability p(hj |v) for hj G {0,1}; 

(c) Compute the conditional log-likelihood function ^(h|v) = lnp(h|v) and 
its partial derivatives db k £(h |v), d c J(h\v), d w £{ h|v); 

(d) The information contained in the hidden States about the parameters, 
9 — (w,b, c), given the visible States is described by the Fisher information 
metric gij(9\v ) = FT h l v [d#T(h|v) d$.£(h\v)]. Compute gij{6 1v); 


(e) Let g(h|v) be a given conditional probability distribution. Using a com- 
putation similar with the one done in section 20.4 provide a learning rule for 
the weights and biasses such that DKL(q(h\v)\\p(h\\-)) is minimized. 






Hints and Solutions 


We have selected a number of exercises for which we provide hints or full 
Solutions. The reader is encouraged to try the other exercises based on the 
expertise acquired from the chapter examples and other similar solved exer¬ 
cises. 

Chapter 1 

Exercise 1.9.1 (a) The problem can be modeled by a neuron as shown in 
Fig. 1 a. If x < b, then the factory does not produce anything, so y — 0. 
If x > b, then the revenue is y — k(x — 6), where k is a positive constant 
related to the cost of prodnction. Consider the following activation function, 
see Fig. 1 b: 


, v _ f 0, if x < 0 
v ' \ kx, otherwise. 

Then the revenue can be modeled as the composition 

n 

V = ip(x - b) = - b). 

i— 1 

(■ b ) The emerging learning problem is the following: Given the amounts Xi, 
what are the values of the road capacities C{ to meet or be very close to a 
given revenue value y? One of the error functions subject to be minimized in 
this case is \{y — y?(x — b )) 2 . 

Exercise 1.9.2 (a) The problem is modeled by a neuron as shown in Fig. 2. 
The output is given by 


f 0, if x < M 

\ k(x — M ), if x > M. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3 


637 


638 


Deep Learning Architectures, A Mathematical Approach 



Figure 1: a. Suppliers send products at capacities Cj to the factory F which 
has a revenue y. b. The revenue function y is piecewise linear. 


If consider the activation function 


<p(x) = 


0. 


if x < 0 


kx, if x > 0, 


which can be seen in Fig. 1 b, then the output becomes 

y = <p(x - M ) = ip{x\w\ H-b x n w n - M ). 

( b ) The learning problem can be stated now as: Adjust the investing rates 
such that an a priori planned profit z on the fund is obtained at a prescribed 
time t. 

Even if y and z might never be eqnal, an accepted answer is given as a solntion 
of the variational problem 

1 1 2 

w = arg min ~(z — y) 2 — arg min ~{z — (p(w T x — M)) , 

2 2 

where w T — (rcp ..., w n ). 

Exercise 1.9.3 (a) We have C(a) = \ fj (ax — f(x)) 2 dx and C\a) — a f} x 2 


dx — fo xf(x) dx, and C"(a) — F 1 x 2 dx > 0. Then a — 3 xf(x) dx and 

b = f(0). i i 

(b) Let C(a, b ) = \ f 0 f Q (ax + by — f(x,y)) 2 dxdy. Then the eqnation 
= (°»°) becomes 


1 


1 1 , 

-a-\—b — 

3 4 ., 0 jo 

ii nl rl 

-a + -b = 

4 3 ., 0 jo 


n xf(x, y) dxdy 

. 

n yf(x, y) dxdy, 









Hints and Solutions 


639 


x x ($)■ 

-^2 (•£)' 
X n<V 



Figure 2: Deposits at rates Wj of amounts xj in certain currencies into a 
given fund. 


which is a linear system with a unique solution a and b. The last coefficient 
is c — /(0, 0). 

Exercise 1 . 9.4 (a) Using Cauchy’s inequality, we have 



The inequality is striet because the functions cannot be proportional. 
( b ) In this case we can compute explicitly 


Pii — 

/ x 2 dx i • • • dx n 


J{ 0,l] n 

Pij — 

/ XiXj dx i • • • dx 

J[ 0,l] n 


■i 


o 


xj dxj — - 

i 3 


•i 


Xi dx x 


o 


0 


1 

Xj dxj — i 7^ j. 


Chapter 2 

Exercise 2.5.1 (a) The function / : (0,1) —> R, f(t) — —t 2 + t is positive 
and has a maximum at t — 1/2, which is equal to /(1/2) = 1/4. Then using 
the sigmoid property, we have a' = a(l — a) = f(cr). Therefore 0 < <j f < 1/4. 
( b ) Since we have 





ca(cx)( 1 — a (cx)) = cf(cr(cx)), 


it follows that 0 < cr' c (x) < |. 
Exercise 2 . 5.2 (a) We have 


f 0, if x < 0 r — 1, 

X 1, if X > 0 _ ” \ 1, 



2 H(x) - 1 


2 


if x < 0 
if x > 0 

















640 


Deep Learning Architectures, A Mathematical Approach 


( b ) From part (a), solving for H(x), yields H(x) — ^(S(x) + 1). Then 


ReLU(x) — xH(x) — ^x(S(x) + 1). 

Exercise 2.5.3 (a) Using chain rule, we have 

' (1 + e x y 


sp\x) — ^ln(l + e x )^ — 




e x (e x + 1) 1 + e 


1 + e x 

— = (j{x). 

—x v 7 


( b ) Since sp'(x ) = <j{x) > 0 , the function sp{x) is increasing. Its inverse is 
obtained as 

sp(x) — y ln(l + e x ) — y <^=> 1 + e x — e y 

x — ln(e y — 1 ) <=^> sp~ 1 (y) — ln(e y — 1 ). 

(c) Differentiating in the formula sp{x) — sp(—x ) — x we obtain sp'(x) + 
sp'(—x ) = 1. Using part (a) we get cr(x) + cr(—x) — 1. 

Exercise 2.5.4 An algebraic computation provides 


2a(2x) - 1 = 


1 + e 


— 2x 


-1 = 


2e 


2x 




- 1 


: 2; ' : - 1 


>2x 


+ 1 


— tanh(x). 


Exercise 2.5.5 (a) Note that so(—x) — —so(x), i.e., the so function is odd. 
Let first x > 0. Then so(x) = y^yy — 1 — yyyy is increasing since yyyy is 
decreasing. Therefore, so(x) is increasing on (0, oo). 

Let X 2 < x\ < 0. Using that so(x) is an odd function and the fact that so is 
increasing on ( 0 , +oo), we have 

so{x i) — so{x 2 ) = so(—x 2 ) — so(—x 1 ) > 0 , 

which implies that so is increasing on (—00, 0 ). 

( b ) so(x) is continuous since so( 0 —) = so( 0 +) = so(0) = 0 . Also, we have 
so(ooT) = 1, so(— 00) = —1. Therefore, so applies R one-to-one, onto (—1,1). 
The inverse is computed on branches. If y E (0,1) we have so(x) = y 

i+x y ^ =>• s ° 1 {y) = t gj - If y e (-M) we have so(x) — y y^-y = 
y <=^> = y^j~. The last two expressions imply so _1 (?/) = 1 y / , , . 

(c) Starting from \x -\- y\ < |x| + |y| we apply that so is increasing 


so(\x + y\) < so(\x\ + \y\) — 

+ 

\y\ 



X 


1 + 

X 

+ \y\ 



X 

+ \y\ 

1 + 

X 

y 

+ \y\ 

1 + 

X 

+ 

y 


< 



X 


1 + 

X 


+ 


1 + \y\ 


= so(|x|) + so(|y|). 






























































Hints and Solutions 


641 


Exercise 2.5.6 An algebraic manipulation provides 


softmax(y + c)^ = 


,Vi+c 


e c • e Vi 


>Vi 


Xj eVj+c e ° • X? eVj X? eVj 


- — softmax(y ) 


j 


o 


Exercise 2.5.7 (a) Y.iP(v)i = & = 1- (&) P(V)i = 4lM|2 = p(y)i- 

II y II2 A IIII2 

Exercise 2.5.8 The function can be written as a sum of two compact sup- 
ported functions, tp(x) — pi(x) + <P2( X ), with 


<pi( x ) = 2P + cos ^ + 

<P2( X ) = l(f,oo)(*)- 


The function <pi(x) is increasing on [—f, f] and constant in rest, while the 
function (fi 2 (x) is equal to 1 for x > 1 and eqnal to 0 in rest. It is obvious 
that <p(x) — 0 for any x < — | and (p(x) — 1 for any x > The graph of the 
function is given in Fig. 2.11 b. 


Exercise 2.5.9 (a) It follows from the fact that a squashing function is a 
sigmoidal function which is nonincreasing. 

( b ) We choose any sigmoidal function which is decreasing on some interval. 
For instance, the function 


<p(x) = xa(x)l [x0jOO )(x), 

where xq is the unique positive solution of cr(x) = 1/x, is nonincreasing and 
satisfies <£>(— 00 ) — 0 and (p(oo) — 1. 

Chapter 3 

Exercise 3.15.1 It follows from the linearity property of the integral as well 
as from the properties of the logarithmic function. 

Exercise 3.15.2 Using that lnx < x — 1, we obtain 

S(p, O.) — ~ JP( x ) l n q( x ) dx > — Jp(x)(q(x) — 1) dx 

= 1 -frixMx)*,. 

Exercise 3.15.3 Each of the terms are nonnegative, Dxl(p\\q) > 0, 
Dkl(q\\p) A 0, reaching their minimum, which is zero, simultaneously, for 

p = q- 









642 


Deep Learning Architectures, A Mathematical Approach 


Exercise 3.15.4 (a) A computation shows 


Dkl(pi\\P2) 


° / N, Pl(») . 

pi(x) ln ——— ax 

P 2 {X) 

roo 

I ln pi (x) dx + (£' 


In^ + K 2 -^) 1 

e - l . X - 1 , 

6 6 


1 



Xpi(x) G?X 


( 6 ) Let /(x) = x - lnx - 1. Since /(|tJ / /(p), then Akx(pi||p 2 ) 7 ^ 

fellei )• 

Exercise 3.15.5 Let pi = P(X — x^), 1 < i < n. Since pi E [0,1], then 
-lnpi > 0, so iL(X) = — ^2iPi In^z > 0. 

Exercise 3.15.6 Since H(X) > 0, see Exercise 3.15.5, then 


Dkl{p\\q) = S(p,?) - if(p) > S(p,q). 


Exercise 3.15.7 Since Z is £-measurable, E[Z|£] = Z. Then the error is 


Z-E[Z\£]\\ = \\Z - Z\\ = 0. 


This corresponds to an exact learning. 

Exercise 3.15.8 The mapping (re, 6) —> corresponds to a hyperplane 

in R n . The optimal parameters, (re*, 6*), correspond to the coordinates of 
the orthogonal projection of the target z onto the hyperplane. By geometric 
reasons, this projection is unique. The normal equations are linear and, hence, 
they can be solved explicitly for re* and 5*. 

Exercise 3.15.9 By L’Hospitars rule, we have 


lim H OL ( p ) 

ol— y 1 


= lim 


a—yl 1 — (X 

lim — ln 

t-y 0 dt 


f N ln f p 1 t (x) dx 

ln p (x) dx — lim- 

J t—► 0 t 


^ f p 1 t (x) dx 


p 1 t (x)dx — lim 


lim 

t^y 0 


t-y 0 Jp 1 t {x)dx 
— f p 1 ~ t (x) lnp(x) dx — f p(x) lnp(x) dx 


f p 1 t (x) dx 


f p(x) dx 


-J P ( X )ln P ( X )d X = H( P ). 



















Hints and Solutions 


643 


Exercise 3.15.10 (a) It is a straightforward computation involving changes 
of variable and completing the square 


~ v ) dt = 




2ircr 2 


2iTCJ 2 

dt — 


e 2cr^ e 


2cr" 


dt 


(t 2 -tv+v 2 / 2 ) \ - 2 


V 


(t-v 2 / 2) 2 


2lTCJ 2 


6 4 cr' 


dt 


1 _ 


-u 


2lTCJ 


e 4cr" 


-u 


du — 


V 


V 


2^/ttcj 


e 4cr^ = 


a /2 


=— e 2 <r ' 2 = ^/(v) 
Z7T G 


for g' — g \J~2. 

( b ) A similar computation shows <fr a * — / s , with s = \/ g 2 + cr /2 . 


Exercise 3.15.11 (a) If Dcs(p,q) — 0 then f pq — yjf p 2 J q 2 . Since this 

means equality in Schwartz inequality, the functions mnst be proportional. 
Then, there is a constant A such that p(x) — A q(x). Since f p = fq = i, it 
follows that A = 1, and hence p — q. The converse is obvious. 

( b ) By Schwartz inequality J pq < yf p 2 f q 2 , so the argument of ln belongs 
to ( 0 , 1 ], where the logarithmic function is negative. 

(c) Obvious. ( d ) It follows from the properties of the logarithmic function 
and defmition of Renyi entropy. 

Exercise 3.15.12 Use the fact that |tanh| < \x\. This follows from tanb/x = 
1 — tanh 2 x < 1 . 


Chapter 4 

Exercise 4.17.1 (a) Compute the Laplacian 

d 2 f d 2 f 

A f ( x ) + —r e Xl sinx 2 e Xl sinx 2 0. 

Ox \ OX 2 

( b ) Since V/(x) = ( e Xl smx 2 ,e Xl COSX 2 ), we have 


V/1| = c Xl || (sin X 2 -) cos X 2 ) 


= e Xl 


(c) V/(x) = 0 <=^> ||V/|| = 0 e Xl — 0, which does not have Solutions. 

(d) Since / is harmonic (or, because V/ 7 ^ 0 ), the function / reaches its 
minima and maxima on the boundary of [ 0 , 1 ] x [ 0 , ^]. Since the functions 
e Xl , x\ E [0,1] and sinx 2 , £2 E [0, 7 t/2 ] are both increasing, it follows that 
the maximum of f(x) is reached for (xi,X 2 ) = (1 , tt/2) and the minimum 
is reached for (xi,X 2 ) = (0,0). Furthermore, min/(x) = /(0,0) = 0 and 
ma x/(x) — /( 1,7t/ 2) — e. 
































644 


Deep Learning Architectures, A Mathematical Approach 


Exercise 4.17.2 (a) VQ(x) — Ax — b. 

( b ) The iteration given by the gradient descent is 

x n+i = x n -??VQ(x n ) 

= x n - r](Ax n - b) = (I - A)x n + b. 

(c) The Hessian is Hq = \A. 

(d) The iteration given by Newton’s formula is 

x n - Hp (x n )V Q(x n ) 

-x n x n - -A~ l {Ax n - b) 

L+T, 

2 2 

Assuming x* = lim x n , taking the limit in the previous relation yields 

/7—TOC 

x* = -x* + -A~ x b, 

2 2 

which implies x* = A -1 6, which was expected. The existence of the limit 
follows from the inductive iteration 

X n+1 = -x n + -A~ l b 
2 2 

= -( -x" -1 + -A^b) + -A~ l b 
2 \2 2 /2 

= -TT X ° + ( -TT + ' ' ' H- N )j4 _1 6. 

2 n+i V2 n+1 2/ 

The hrst term tends to zero and the second tends to A -1 b. 


x n+1 = 


Exercise 4.17.3 (a) The cost function can be written as 


C(x) — -\\Ax — b\\ 2 =-(Ax — b, Ax — b) 


= 7 ^(A t Ax,x) - (Ax,b) + ^\\b\\ 2 - 


The gradient is VC(x) = A T Ax — Ab and the Hessian is given by Hc(x) = 
A T A. Since det A ^ 0 and the matrix A T A is symmetric, Hq has real nonzero 
eigenvalnes, which are the squares of the eigenvalues of A. Therefore, Hq is 
positive definite. 

(■ b ) We have 


x 


71+1 _ 


= x n - 7]VC{x n ) = x n - 


— (/ — r]A T A)x n — Ab. 


tjA t Ax n — Ab 







Hints and Solutions 


645 


Exercise 4.17.4 (a) Inductively, we have 

^n+1 ^ I^O n 4" 77 

^ + 77) + 77 

— fj 2 Un -1 + T7(/i + 1) 


< /i n+1 ag + 77(/i n + /i n ■*■ + ••• + /i + 1) 


n 11 l-//^ 1 77 

— iu clq 77- 


— < ao + -- 

1 — /i 1 — /i 


(5) Taking the norm in Equation (4.4.17) yields 


v 


71+1 


< /i||^ n || +?7||V/(x n ) 


n 


Consider the sequence a n — 
part (a), the sequence (a n ) n is bounded. 

Exercise 4.17.5 (a) It follows from Fubinhs Theorem. 
(5) Using part (a) applied to |/| and |g|, we obtain 


. Then a n+ i < iia n + 77, with 77 = rjM . By 


/ *3 


1 — 


< 


dx 


j \f * g\(x) dx = J J f{y)g{x - y)dy 

J J \f(y)\\g(x - y)\dydx 


(l/l * \d\)( X ) dx = / |/|(x)dx / |p|(x)dx 


= ll/lli Ib 


1 


(c) It follows from part (5) by taking p = GG and using ||Go-||i = f R Gcr(x)dx — 
1. More generally, for any 1 < p < oo we have \\f a \p < ||/|| p ll^crlli — ||/||p, 
i.e., if / E 77 then f a E 77. In particular, for p = 2, we obtain that by 
hltering a finite energy signal we also obtain a finite energy signal. 

Exercise 4.17.6 It is a straightforward computation as in the following: 


(G ai *G c r 2 )(x) — J G cri (u)G cr2 (x — u) du 


271 "( 71(72 


1 \( u \ ( x ~u \2i 

2 { n du. 


After completing the sqnare, the exponent can be written as 


u \ 2 /x — u ' 2 

+ 


°i 


02 


of 4 ~ o\ 


<7 


U — 


1 


crj + (Jo 


2 

X I + 


X' 


+ + (To 







































646 


Deep Learning Architectures, A Mathematical Approach 


Changing the variable and evaluating the Gaussian integral, we have 


(G cri * G(j 2 )(x) — 


27I"(7i<T2 

1 

27T(Ji(J2 


a l+ a 2 I 

( 

\ 2 

^ * 

e ' 

\ a l +a 2 X 

/ e 2 ^i +(T 2 ) 2 y u 

x 2 

r a i +cr 2 

V 2 

2(af + ai) 

/ e Vfci 

2 dv 


X 


e 2 ( ct i +,t 2 ) = G 


\l 7 kK\[o\ + o\ 


W i +<r 


(*) 


There is a proof variant using Fourier transforms as in the following. If 
•^(/XO — f f(x)e~ 2nlx ^ dx denotes the Fourier transform of /, using the 
properties 


Hf*gm = F(m)Hgm 

2 7 T (tt a ) 2 

T{e~ ax ) = 4-e- “ , 

V a 

we have 

^(G^ * G CT2 )(0 - ^(G^XO ^(G CT2 )(0 = e -(2^i0 2 e -( 2 ^0 2 

= e- 22 " 2 ^ + ^ 2 = ^( G(T ), 

with cr = yVX + of. Applying the inverse transform, J 7-1 yields the desired 
resuit. 

Exercise 4.17.7 We consider the variational problem with constraint 

\ 

L = -(c^ + • • • + cr^) — A((Ji + • • • + cr n ~ s), 

with A Lagrange multiplier. The minimum satishes 

dL 

—— = <jj — A = 0 <jj — A, VI < j < n. 

0(7j 


Chapter 5 

Exercise 5.10.1 (a) The mapping is (0,0) —>► 0, (0,1) —>► 0, (1,0) —> 1, 
(1,1) —> 0. The point (1,0) is separated from the other points by the line 
y — x — 1/2. Then the perceptron with the output y — H(x — y — 1/2) learns 
the previous assignment. We have w\ — 1,W2 — —1,5 = —1/2. 

(5) Similarly with (a). 



















Hints and Solutions 


647 


(c) The output of the 1-dimensional perceptron is y — H(—x + 1/2). The 
weight is w = — 1 and the bias is b = 1/2. Since for x Boolean variable we 
have -ix — 1 — £, the output of the associated linear neuron is y — wx + 6, 
with weight and bias given by w — — 1 and 6=1. 

( d ) We can visualize the learning of x\ A £2 A £3 as a separation of the cube 

corner (1,1,1) from the other corners. This is done by the plane £i+£2+£3 — 
5/2. The perceptron which learns x\ A £2 A £3 has weights w\ — W2 — = 1, 

bias 6 = —5/2, and the output y — H(x\ + £2 + £3 — 5/2). The perceptron 
which learns £1 V £2 V £3 has the output y — H(x\ + £2 + £3 — 1/2). 

Exercise 5.10.2 Show that the sets A and B can be separated by two distinet 
lines and then find a line in between with rational weights. 

Exercise 5.10.3 (a) The matrix A — K[XX T ] is diagonal, with Aij — cr^Sij. 
The optimal weights are w* = A _1 b, with w * = bj/o 2 3 - — E [ZXj\/a 2 -. 

(6) Since Z and Xj are independent, E [ZXj\ — E[Z]E[Xj] = 0. Then the 
cost function is £(w) = c + w T 4w, with c = E[Z 2 ]. If A is positive definite, 
its minimum is reached for w = 0. A zero mean target, independent of the 
input, is learned with zero weights; by this choice the input does not matter. 
(c) Newton’s iteration is 

W (j+1) = w (j) - i7 _1 VC(w (i) 

= — 2^4 -1 (2t4w( j ) — 2b) 

= -3w (f > + 4^ _1 b. 

Exercise 5.10.4 In linear regression there are provided with N vectors in 
R n , Xj, corresponding to N numbers, Zj, 1 < j < N. We look for n + 1 
parameters, wq, wi, • • •, w n such that the sum of sqnared errors 

1 N 

C( w) = — YXW!*} H-h w n x” -WQ- Zj ) 2 

3 = 1 

is minimum. Let X and Z be the input and the target variables of a linear 
neuron. They are provided in practice by N measurements, Xj = X{ujj) and 
Zj — Z(ujj), for some N state of the world u)j G , 1 <3 < N. Also, let 

X 0 = -1. 

The training set is given by T — {(xj, Zj) \ 1 < 1 < N}. The cost function 
can be written using the empirical evaluation of the expectation as 

1 N 

c( w) = E[(X t w - Z) 2 ] = - - X- 

3 = 1 

Hence, both the linear regression and the linear neuron minimize the same 
cost function. 


648 


Deep Learning Architectures, A Mathematical Approach 


Exercise 5.10.5 Assume that xq E (0,1). Then 


V 



= H(x 0 ) = 1 , 


since xq > 0 . 

Assume xq ^ (0,1). Then 


y 



H (xol[o,i](xo)) = 1 - 




(x) dd XQ (x ) 


Exercise 5.10.6 We have 


(tUl,Xi 0 ) = (wq + Xj Q , Xj 0 ) = (w 0 ,Xj 0 } + 


x io 


= (w 0 ,x io ) + 1 > 0. 


since (w 0 , x* n ) < Iwoll ||x io || = 1. 


Since all points Pj belong to a half-circle, let re* be the unit normal vec¬ 
tor to the diameter of the circle, which separates the points from the rest 
of the circle. Denote by p m the angle between re* and w m and let S = 
min{cosZ(re*,x^); 1 < i < n}. Then 


W*,W m ) = (w*,w m - 1 +yii 


'Im — l 


= (w*,w m - 1) + <W*,Xj 


m — 1 


> (w ,w m - i) + A 


Iterating the inequality, we obtain (w*,w m ) > (w*,wq) + mS. Then 


cos p m 



(re*, w m ) > (w*, wq) + mS. 


The inequality 


1 > cos p m > (re*, wq) + mS 


is contradictory, since the right term increases unbounded as m —> oo, while 
the left side is bounded. Hence, the process should end after fmitely many 
steps. If N is the maximum number of iterations, then NS ~ 1 which provides 
the estimation N 1/5. 

Exercise 5.10.7 The perceptron learning algorithm is valid for the case of 
points in a half-plane. The reason is that if a diameter separates the points 
dehned by the unitary vectors x^, then the same diameter separates any 
other points whose vectors have the same direction as the previous ones. 
This follows from the equivalence (w, x^) > 0 (w, Ax^) > 0 for all A > 0. 


















Hints and Solutions 


649 


Exercise 5.10.9 Consider the mean square error computed at step /c, C(wk) — 
^(zk—WfcXk) 2 - The gradient is V^C — — (zfc—w)Txfc)xfc. The gradient descent 
rule is — r]S7 Wk C — Wk + d e k x k- If let c — |x&| and dehne a — r\(? 

we obtain — Wk + which is the a-LMS update rule. 

Chapter 6 

Exercise 6.6.2 (a) Consider the unit cube with one vertex at the origin and 
with three sides in the direction of the axes of M 3 . Since no single plane 
can separate the cube vertices (0,0,0) and (1,1,1), it follows that a single 
perceptron cannot learn the mapping. (6) Try a one-hidden layer network. 

Exercise 6.6.3 

y — H ( — t lH(x\ + X 2 + 1/2) — 3/2) — H^H(xi + X 2 + 1/2) + 5/2) — 1. 

Exercise 6.6.4 (a) By chain rule ^ = xa'(wx + b ) and = cr'(wx + b ). 
The gradient is VC = (|^, §^), with 

ac , , ac 

— ( a(wx + b) — z)xa'{wx + 6) 

dC . , dC 

m = 

— (<j(wx + b) — z)<j\wx + b). 

Using that cr' < 1/4 and cr < 1, we get 


VC|| = 


ac\2 /ac\2 


dw 


> 


+ 


db 


■ 


c j(wx -\-b) — z\ a f {wx + b)\/ 1 + X 2 < -v / TT X 2 (i + N)- 


( b ) For a step ?] > 0, we have 


«Wi = w n - r/xcr'(w n x + b n )(a(w n x + b n ) - z) 
b n+ 1 = b n - r]a'(w n x + b n )(a(w n x + b n ) - z). 


Exercise 6.6.5 (a) Let = X. Taking the norm in the sensitivity relation 

dY = ^(s (2) )VF (2)T 0'(s (1) )VF (1)T dX^\ 


dY || = |<//(s (2) )|||kF (2)T || |0'(s (1) )l ||VF (1)T || \\dX^ 

< ||^|| 2 ||H /(1) || ||iF (2) || ||dX (0) | 


we obtain 
































650 


Deep Learning Architectures, A Mathematical Approach 


It suffices to choose p = || W ( 2 )|r 

( b ) If the input X is noisy, then small variations dX occur. For noise removal 
purposes, these variations should have the least possible effect on the output 
variation dY. This can be achieved in two ways: keeping the norms of the 
weight matrices small or choosing an activation function with <fi f small. The 
weights regularization for noise removal can be added as a constraint, such 
as IlVF^H 2 + ||VF( 2 )|| 2 < 1, to the cost function. 

(c) We need to choose the activation function with the smallest derivative 
norm ||<//||, i.e., the activation function with the slowest saturation. Then use 
that tangent hyperbolic saturates faster than logistic sigmoid. 


Exercise 6.6.6 (a) The signals and outcomes of the layers are given by s ^ = 
w\x — &i, x^ — a(s^) — <j(w\x — 6i), s^ 2 ) — W 2 cr(s^) — &2 5 and y — x^ — 
cr(s^). Using chain rule 


y( 2 ) 

5 (1) 


dC 
ds ( 2 ) 
dC 
ds W 


1 ( \ dy 

2 {y ~ Z) MV 

dC ds<® 


ds ( 2 ) cteW 


= \(y - z)ct'(s {2) ) 
5^W2cr' (s^). 


( b ) The components of the gradient, VC, are given by 


dC 

dC 

ds^ 

dw2 

FT 

II 

dw2 

dC 

dC 

ds^ 

dw\ 

dsd) 

dw\ 

dC 

dC 

ds ( 2 ) 

db2 

C? 

CO 

II 

db 2 

dC 

dC 

ds ^ 

db\ 

dsd) 

dbi 


S^x 
-6 ( 2 ) 


Exercise 6.6.7 (a) We shall use backwards induction. Note that w}P and 

are independent. Then assume that and 5^ are independent for 

any 7+1 < k < L. We need to show that W\- and d\ 7 are independent. 
Since = 1, we can express the deitas of the 7th layer in terms of the 

deitas in the (7 + l)th layer as 





Now, is independent of by the induction hypothesis and indepen¬ 
dent of as weights in distinet layers. Therefore, is independent 





















Hints and Solutions 


651 


of the combination JU and hence independent of 5^\ 

( b ) By induction, similar with part (a). 

Exercise 6 . 6.8 Using the approximate formula Varf(X) f'(E[X]) 2 Var(X) 
for /(x) — <j(wx + 5), we obtain 

VarY <j'(wK[X\ + b) 2 V ar{wX + b) = a' (b) 2 w 2 Var(X) — a\b) 2 w 2 . 

When the bias takes a large positive or negative value, the sigmoid saturates 
and its derivative becomes small; conseqnently VariY ) decreases. 

Exercise 6.6.9 Using an idea similar with the one used in the previous prob- 
lem, we have 

Var(Y ) = a 2 Vara(wiX + bi) ~ ^^a 2 a'(bi) 2 w 2 Vara(X) 

i i 


Exercise 6.6.10 Since p(x) — 1/(5 — a), then 


f b 

H{p) — — p(x) ln p(x) dx 
J a 


*b 


i i ln T, — 

b — a l a b — a 


dx — ln(5 — a) 


0 -/ 0 ' 


Exercise 6.6.11 Let p(x) — 2^2 . Using relations f R p(x) dx — 1 and 

J M (x — p) 2 p(x) dx — a 2 , we have 


H(p) — — p(x) ln p(x) dx 

Jr 


- ln(27r) / p(x) dx + ln a f p(x) dx + 

2 Jr Jr 

2 2 

- ln( 27 r) + ln a H— = ln(cr\/ 27 re). 

2 2 


2<j 2 


R 


(x — p) 2 p(x) dx 


Chapter 7 

Exercise 7.11.1 It follows from |tanhx| < 1. 

Exercise 7.11.2 (a) Let / G J 7 , with /(x) = ax+5. Then modulus inequalities 


yield | f (x) | < 


a x 


+ |5| < \a\ + |b| < 1, for all x G [0,1]. The family T is 


uniformly bounded with M — 1. 

Fix e > 0 and choose 77 = e/|a|. Then for any x,x' G [0,1], with |x — x'\ <77 


we have |/(x) — f(x') \ — \ax-\-b — ax' — b 
T is equicontinuous. 


a 


x — x'\ < e. Hence, the family 























652 


Deep Learning Architectures, A Mathematical Approach 


( b ) Applying the Arzela-Ascoli Theorem, we obtain: Among all possible out- 
comes of a linear neuron (with 1 -dimensional input and output), y — ax + 5, 
which satisfy the regularization constraints, \a\ + \b\ < 1, there is a seqnence 
y n — a n x-\-b n which converges uniformly to an affine function, f(x) = ax + b. 
On short, the linear neuron has the capability of learning some affine function. 

Exercise 7.11.3 By the Fundamental Theorem of Calculus, we have f(x 2 ) — 
f( x 1 ) = jy f (x) dx. Then Cauchy’s inequality provides 


f( x 2 ) - f( x l) I 2 



2 


< 


< M\x 2 — x\ 




f(x ) 2 dx 


Then for any given e > 0 , choose 77 = e 2 /M and apply the definition of 
equicontinuity. 

Exercise 7.11.4 Fix e = 1. Then for any xq G D , there is 77 > 0 such that for 
x G (xq — 77 , xq + 77 ) we have 


1/0)1 < 1/0) - /0o)| + |/0o)| < 1 + m, 


V/ G J~ ■ 


Note that Ux 0 ei7( :ro ~ + d) — &)? an y # C (a, 6 ) belongs to a neigh- 

borhood of the type (xq — 77 , xq + 77 ). 


Exercise 7.11.5 Let (j) G J 7 . Fix e > 0. Then by equicontinuity of J 7 , there is 
77 ^ > 0 such that \fj(x) — fj(x')\ < e/k for \x — x'\ < rjk , 1 < J < k. Choose 

77 — min rij. We have 

i<j<fc 


k 


/0) - 


/001 < N 

. 7=1 




fc 


/i0) - /j0')l < N l/jt*) “ /i0')l < 

.7 = 1 


for 


/y» _ y»' 


< 


77 . 


Exercise 7.11.6 It follows from Dinhs Theorem applied to the decreasing 
positive seqnence of functions f n — \f{pc) — G n (x)\. 


Exercise 7.11.7 Periodic functions with period T can be considered as func¬ 
tions on the circle of radius R — 2n/T, which is a compact set, denoted here 
by K. Therefore, / G C(K). The set of all functions T, which can be written 
as trigonometric sums, form a subalgebra of C(K ), which separates points 
and contains constants. An application of the Stone-Weierstrass Theorem 
produces the desired resuit. 


Exercise 7.11.8 The main ingredient in the proof of Proposition 7.7.2 is that 
a'\\ < 1/4. Given that most sigmoid functions are symmetric with respect to 

















Hints and Solutions 


653 


the origin, we have ||cr / || = cr / (0). Therefore, the resuit holds for all increasing 
differentiable sigmoid functions satisfying (/(O) < A < 1 . 


Exercise 7.11.9 By Arzela-Ascoli Theorem, we may assume, eventually pass- 
ing to a subsequence, that fj converges uniformly to / on [a, b\. Then f 2 

converges uniformly to / 2 on [a, b]. Then J^f 2 (x)dx —> /a / 2 (*) dx ■ Using 

hypothesis, it follows that J^ f 2 (x) dx — 0 . Since / 2 > 0 , it follows that / = 0 . 

That is lim fj — 0, uniformly. We also note that the proof could have been 

j~K >o 

done by the contradiction method, without using Arzela-Ascoli Theorem. 


Exercise 7.11.10 (a) Let e > 0 be arbitrary fixed. Using the uniform conti- 
nuity of the function K(s,t) (continuous function on a compact set), for any 
p > 0, there is g p > 0 such that if | s' — s\ < g p then | K(s',t) — K(s,t) i < p- 
Since p is arbitrary, we choose it such that p e/y/M(d-c). 

For any g G Tm , applying Cauchy’s inequality, we have 


lff(s 7 ) - d( s ) 


< 


f ^K(s', t ) — K(s, t)^j h(t ) dt 

•J c 

/ d v 1/2 / rd v 1/2 

\K(s\ t) — K(s, t )| 2 dtj [J h(t ) 2 dtj 


< p\Jd — c\[M < e. 


s — s 


Hence, we have | g(s') — g(s) \ < e, whenever 
independent of g. 

( b ) Using Cauchy’s inequality, for any g G Tm , we have 


< 77, with e and 77 



< 

< 



Ff(s, t)h(t) dt 


< 



K{s , £)| |/i(t)| dt 




K 2 {s.t)dtf /2 M l/2 . 


Being continuous on the compact [a, 6 ], the function s K 2 (s,t) dt is 

bounded. It follows that all functions g have a common upper bound. 


Exercise 7.11.11 (a) It’s just a verification of the algebra axioms. ( b ) It fol¬ 
lows from the Stone-Weierstrass Theorem of approximation. (c) Any contin¬ 
uous function on [a, b} can be learned uniformly by a sequence of polynomials 


m e 


X 















654 


Deep Learning Architectures, A Mathematical Approach 


Chapter 8 

Exercise 8.8.1 (a) For any <j> € Cq° we have 

/ roo 

H(x — xq)(/>' (x) dx = — / <//(x) dx — —(0(oo) — (j>{x o) — 4>(xo) 

J XQ 

— J cj)5(x — xq) dx. 


( b ) For any (j) E Cq 0 
— / ReLU (x)<//(x) dx 



The others are similar. 

Exercise 8.8.2 Shifting indices and parsing the sums, we have 


N-l 


N -1 


^2 ad[xi,x i+1 )(x) = 22 Oii[H{x - Xi) - H(x - x i+l )\ 
i= 0 i= 0 

N-l N 

— cqi7(x — xi) — Oij-\H[x — Xj) 

i= 0 3=1 

N-l 

— «o H{x — xo) + cqi7(x — xi) 

i =1 

N-l 

— ai-iH(x — xi) — ajsf~iH(x — xjy) 

i =1 

N-l 

— aoH(x — xq) + (cq — cq_i )FF( X — X^) — <X]\f —i FF (x — Xjv). 

i=l 


N 

Equating against the sum CiH(x — x^), we obtain the following coefficients: 

F=0 

Co — tro, Q = cq — (Xi— i, and c^y — (Xjsf— 

Exercise 8.8.4 Consider a decreasing sequence of positive numbers, e n \ 0, 
and consider the function f n (x) — \g(x) — g €n (x)|. Given the construction 
of g 6n (x), we have 0 < / n +i(x) < / n (x) for any x. By Dini’s Theorem, 
the sequence of functions f n converges uniformly to 0, i.e., g €n converges, 
uniformly, to g(x). 




Hints and Solutions 


655 


Exercise 8.8.6 Changing variables and using L’Hospital’s rule, we have 

ln(l + e x / a ^j 


lim ( p a (x ) — lim oTn(l + e x ^ a ) — lim 






ln(l + e tx 
lim - = lim 


(1 + e tx ) 


l 

a 
tx\/ 


t /oo 


= lim 


xe 


t 

tx 


t /*oc 1 + A x 

lim xaAx) — 

t /"'oo 


= lim 


t /"'oo 1 + 6 

X 


tx 


t /"'oo 1 + 6 ^ x 

x , if x > 0 

0, if x < 0. 


which leads to the first part of the resuit. If let t = 1/a > 0 then 


(p a (x) 


aln(l + e x/Q ) 


ln(l + e xt ) 


which, for a small (t large), 

x > 0. 


behaves as 


i+e-xt ? which is increasing in t for 


Chapter 9 

Exercise 9.8.1 In this case the target space is the metric space (R, d), with 
the distance function d(x,y) — \x — y\. The approximation space is Q. Any 
real nnmber can be learned by rational numbers. Actually, one of the con- 
structions of the real nnmber’s field is based on this property. 

Exercise 9.8.2 The target space is the space of continuous functions C[a, 6], 
endowed with the distance d(f,g) — sup[ a ^ | f(x) — g(x)\. The approximation 
space is the set of all polynomials V[a,b\. Any continuous function can be 
learned by polynomial functions on a compact interval. 

Exercise 9.8.3 Let T — {ax\ + Axo; cp A e R} be the space generated by the 
noncollinear vectors xo and x\. Define the functional L : T —>> R, L(t) = A, 
where t — ax i + Axo- It is obvious that L is linear. By a procedure similar 
with the one used in the proof of Lemma 9.3.1 we can show that L is also 
bounded, with the norm ||L|| < 1/5, where S is the distance between xo and 
the support line of x\. By Hahn-Banach Theorem the functional L can be 
extended to a linear bounded functional on A keeping the same bound. We 
can easily verify that L(xo) = 1 and L(xi) = 0. 

Exercise 9.8.4 Applying the construction used in Exercise 9.8.3 symmet- 
rically for xo and then for xi, we obtain two linear bounded functionals L\ 
and L /2 on A such that 


L i/ 0 ) = 0, L i(rci) = 1, L 2 (x 0 ) = 1, L 2 (x i) = 0. 









656 


Deep Learning Architectures, A Mathematical Approach 


Consider the average functional L — ^(Li + Z/ 2 ), which is linear and bounded. 
We have L(x 0 ) = L(x 1 ) — 1/2. Moreover, its norm satisfies 



1 

2 


L\ + L 2 W < -( 



+ IIP2II) < 


1/2. M 

2U 0 + sj 


S 0 + ^1 

2SqSi 


By a similar procedure we can prove the following more general statement: 
given the linearly independent set of vectors {xo, xi,..., xjy} in Af, there is a 
bounded linear functional L on Af, snch that L[x 0 ) = L[x 1 ) = • • • = L(xjy). 


Exercise 9.8.5 Consider in Exercise 9.8.4 the space X — C[a,b\ and the 
independent vectors xq — sint, xi = cost. Then there is a bounded linear 
functional L : C[a, b} -E R such that L(sint) = L(cost). By the representation 
theorem, Theorem E.5.6, exists a unique finite signed Borei measure fi on 
a, 6], such that 

L(f) = [ f(t) dfi(t), V/ € C[a, b]. 

J a 

Therefore, the identity L(sint) = L(cost) becomes 


Moreover, since 
Exercise 9.8.4. 



•b 


•b 


sin tdg(t) — / sin td/i{t). 


a 


a 


/i|([a,6]). An upper bound for ||L 


is provided in 


Exercise 9.8.6 (a) Note that P(P) — P( 1), so L is a linear functional. Since 

P 


L(P) | = |P(1)| < sup |P(x) 

[ 0 , 1 ] 


00 1 


it follows that ||L|| < 1. 

( b ) The functional L can be extended linearly to C[ 0,1] by Hahn-Banach 
Theorem, keeping the bound ||L|| < 1. Since now L is a bounded linear func¬ 
tional on C[0,1], the representation theorem, Theorem E.5.6, provides the 
measure /i such that L(f) — /(x) d/i(x), for all / E C[ 0,1]. In particular, 

for f — P we obtain L{P) — P(x) dju(x), or equivalently, 

f P(x) d/i(x) = no + Q-i T • * * T a n i VP E P([0,1]). 

4o 


Exercise 9.8.7 (5) The function (j>{x) — e x2 E P 1 (R) and f R e x2 dx ^ 0. 
By point 2. of Remark 9.3.19 the function is discriminatory in L 1 -sense. 

Exercise 9.8.8 (a) The output is 

N 2 Ni 

y = yNAT Wjjo(\jX + bj ) + f 3 i\ 

i=l j =1 


( 1 ) 


































Hints and Solutions 


657 


with the following notations: bj are the biasses of neurons in the first hidden 
layer, /3i denote the biasses of the neurons in the second hidden layer, A j are 
the weights from the input to the first layer, Wji are the weights between the 
hidden layers, and ol{ are the weights between the second hidden layer and 
the output. 

(5) Adding, subtracting, and multiplying by scalars expressions of type (1), 
we obtain something of the same type, with different parameters, eventually 
some equal to zero. 

Exercise 9.8.9 (a) Take the continuous function to be affine, f{x) — w T x + b. 
( b ) Let IA denote the set of continuous fnnctions of type (1), i.e., the set of 
all outputs of two-hidden layer FNNs. We show that IA is dense in C(I n ). By 
contradiction, if U is not dense, by Lemma 9.3.2 there is a nonzero bounded 
linear functional L on C(I n ) such that L\ u — 0. By the representation theo- 
rem, Theorem E.5.6, there is a signed measure fi on I n such that 


n N 2 iVl 

/ '^2 l Wj i cr{\jX + bj) + fa) dn(x) = 0. 

'' In i=l j=l 


for all valnes of the parameters. In particular. 


/ a(^^Wjicr(\jX + bj) + (d^j d/u(x) = 0. 
J In J —1 


Since f(x) — i Wji&i^jX + bj) + (di is continuous and a is strong discrim¬ 
inat ory, it follows that /i = 0, fact that contradicts ||L|| = \/i\(I n ). 


Chapter 10 

Exercise 10.11.1 (a) The network output is of the form 

2 2 

G(x) = (W T X + bi) = ^2 ( w li x l + W2i%2 + b{). 

i= 1 i=l 

We make the simplificative assumptions wn = W 12 — w i and W 21 — W 22 — 
W2- Then we are left with 6 parameters, which satisfy 4 equations: 

0 = (7(0,0) = ot\H (b\) + QL 2 H (P 2 ) 

1 = (7(0,1) = a\H(w2 + b\) + a2H(u)2 + 62) 

1 = (7(1, 0) = a\H(w\ + 61) + ot 2 H(wi + 62) 

0 = (7(1,1) = a\H(w\ + W 2 + W 2 + b\) + a 2 H(w\ + 62). 







658 


Deep Learning Architectures, A Mathematical Approach 


One possible solution is ol\ = 1 , 02 = —1, w\ — —0.5, — — 2 , b\ — 2.25, 

62 = 0.25. Hence, an output which learns the XOR function is 

G(x 1 , x 2 ) = i?(-0.5xi - 2x 2 + 2.25) - i7(-0.5xi - 2x 2 + 0.25). 


2 

Exercise 10.11.3 Obviously, ip(x) — e~ x E L 1 (R), with the Fourier trans- 
form 4/(<^) = ?/>(£) = y^e - ^ 2 / 4 , so 4/(1) = y^/e 1 / 4 . Then Irie-Miyake’s 
formula becomes 




I e^°-( xtw -™°) 2 F(w)dw. 

M n +! 


Exercise 10.11.4 Consider the function 



_ 1 

e if ^ < x < 1 

0 , if 0 < x < 


It can be shown that / is continuous on [0,1] but it is not analytic. (If / would 
be analytic on ( 0 , 1 ), then, since /|( 0 , 1 / 2 ) — 0 , then by the identity theorem 
of analytic functions we get / = 0 on ( 0 , 1 ), contradiction.) Since the logistic 
sigmoid is analytic, then the outcome of the network is also analytic, while 
/ is not. 

Exercise 10.11.5 Use that condition (ii) becomes the convergence of the 

n 2 

E 9i 

\2(n+l) 
i> 1 A i 

Exercise 10.11.6 Integrating in relation K^ n \tR) — ^i e< i (t) yields 



x (n) (t , t)dt = y] x? [ 

i> 1 J ° 


e 2 i ( t ) dt = y A?, 

i>l 


and use that the integral on the left side is finite. 

Exercise 10.11.7 (a) Consider intervals of length 2 centered at 1, 3, 5, and 7. 
Then choose the simple function 


c(x) — 1 [ 0 , 2 ) 0*0 + 3 . 1 [ 2 , 4 ) (%) + 2 . 1 [ 4 j 6 )(x) + l[ 7 ? 8 )(x), 

which learns the data. ( 6 ) Since each indicator function can be written as a 
difference of step functions, for instance, 1 [ 2 , 4 )(^) = H(x — 2 ) — H(x — 4), we 
obtain 


c(x) — H (x — 0) + 2 H(x — 2 ) — H(x — 4) — H(x — 6 ) — H(x — 8 ), 


and choose this function as G(x). 





Hints and Solutions 


659 


Exercise 10.11.8 It follows from the composition associativity property. 


Chapter 11 


Exercise 11.10.1 (a) Since {cj;c < t} = 


the sigma-field 


0 , if c > t 

fl, if c < t 

generated by c is the sigma-held generated by the sets 0 and 11 , which is the 
trivial algebra { 0 , 0 }. 

(5) It is shown by double inclusion as in the following: Since 


{cj; c + X(u) <t} = {cj; X(u) < t — c} E ©(X) 


then ©(c + X) C ©(X). Similarly, since 


{cj; X(cj) <t} = {cj; c + X(cj) < t + c} E ©(c + X), 
it follows that ©(X) C ©(c + X). 

(c) Assume c > 0. Use that {uycX(cj) < t} = {cj;X(cj) < t/c} E ©(X) and 
hence ©(cX) C ©(X). The converse inclusion can be shown similarly. 

Exercise 11.10.2 (a) Since the value of Y n is determined by Xi,..., X n , then 
Y n — F(X i,..., X n ), and hence &(Y n ) C ©(Xi,..., X n ). The striet inclusion 
follows from part (6). 

(■ b ) Since X n — Y n — Y n _ i, then ©(X n ) C ©(T^_i,Y^). We have 


6(X!,...,X n ) = ©(5(V 1 )U---U6(V„)) 

C 6 (©(y 0 , Yi) u • • • u ©(Yn-!, y„)) 
C 6(Yi,...,Y n ), 


which proves the converse inclusion. 

(c) The formula for Z n can be shown inductively. The identity of sigma-helds 
follows from part ( 6 ), which implies 

6(Yi,..., Z n ) = 6(Yi,..., Y n ) — &(Xi ,..., X n ). 


Exercise 11.10.3 We draw the Venn diagram with sets A and B intersecting. 
We obtain 4 regions: R\ — A\B , C 2 — AnB, R 3 — (iUB) c , and R 4 — B\A. 
The desired sigma-algebra is generated by the partition {i?i, ^ 4 }, i.e., 

its elements are obtained by taking unions of subsets. There are 2 4 elements. 
Using de Morgan’s relations we obtain the sets given in the exercise. For 
instance, R\ U i ?4 = ( A\B) U ( B\A ) = (A D B c ) U ( A c n B). 


660 


Deep Learning Architectures, A Mathematical Approach 


Exercise 11.10.4 We draw the Venn diagram with sets A , B , and C inter- 
secting. We obtain a partition of D into 8 distinet regions. The S-field will 
have 2 8 = 256 elements, too many to be written down explicitly. 

Exercise 11.10.5 (a) We have 

J r E ll E 2 = I®? ^i, E 2 , Ei U i?2, E\ n E 2 , {E\ U E 2 ) c , (Ei n E 2 ) c , D}. 

It suffices to recover the sets E\ and E 2 . This is done by 

N 

El = ( U E i) C ’ E- ‘ = ( U E i) C ■ 

j=2 N j= l,j/2 

(b) Since = {0, E\^ Ef, fl}, and E\ can be recovered from the other sets 
as Ei = u Z __ 2 Ei , it follows that J 7 e 1 is a recoverable body of information. 
Using that J 7 e 1 C Xe 1: e 2 , if follows that Ee 1 is not maximal. 

Exercise 11.10.6 This is an application of Theorem C.1.3. It will be proved 
by double inclnsion. Let V — {f^ =1 {cj; Xi(cj) < Xi},k > l,xi G R}, which is 
ap-system (closed to intersections). For k — 1, we have {cj; Xi(uj) < Xi} G V , 
which implies &({uj]Xi(uj) < Xi}) G &(V), or equivalently &(X{) C &(V). 
Taking unions yields C &(V), and taking sigma-algebras we get 

©(lJi©(^)) C ©(P), which means the inclusion &(X) C &(V). Next, we 

show the opposite inclusion. Since V is included in f] i &(Xi), we also have 

V C S ( ULi ©PQ)) = E). As a p-system, V, included into a d-system, D, 
by Theorem C.1.3 from Appendix, we obtain 

n 

6(P) c S( [J ©(I,)) = 6(1), 

i= 1 

which proves the opposite inclnsion. 

Exercise 11.10.7 (a) Since Y is determined by Y, we have Y — f(Y ). Then, 
by Proposition D.5.1 of Appendix, we have &(Y) C @(T), or £ C £. Then 
X\£ C I\£. Taking the generating sigma-algebra on both sides yields C C C. 
(b) If a neuron is dropped from the £th layer, a verbatim argument shows 
that C To conclude, dropping out neurons leads to information loss 
in the layer where the dropping occurs. 

Exercise 11.10.8 (a) In this case Y depends on Y, so S(Y) C S(Y). A similar 
proof with the one use in Exercise 11.10.7 yields the inclusion C C C. 

Exercise 11.10.9 In case a one unit in the hidden layer would suffice to 
classify the points in each of the half-planes. In case b we need three units 


Hints and Solutions 


661 


in the hidden layer, each unit corresponding to a side of the triangle. Each 
unit learns a half-plane and a triangle can be written as the intersection of 
three half-planes. Similarly, in case c we need 4 neurons. Using the activation 
function H(x) leads to sharp corners for the triangle and rectangle, while 
using a sigmoid function leads to roundup corners. 

Exercise 11.10.10 It suffices to show that both parts are equal to the sigma- 
field SfJU^UH). This will be done by double inclnsion. 

We show first (J 7 V Q) V T~L C 6(JU5UH). Since T~L C U g U XL) and 
T U Q C 6(JU6U?{), it follows that ©(J 7 U Q) C ©(J 7 U t? U H), and then 
6 (J r ug)UXL C &(XuguXL). Taking the sigma operator on both sides yields 

© (©(j 7 ug)un) c ©(J 7 u g u n\ which is (jvg)v^c e(x u g u u). 

We show now ©(JU^U^) C (J 7 V g) MXi. Starting from XU g C ©(XU g), 
we have T U g Ul~i C ©(J 7 U g) UXL. Taking the sigma operator we obtain 

©(XU^UH) C © (&(XUG)Uxfj , which is &(XuguXL) C (FV£/)V4A Hence, 

6 (XuguXL) — (XVf?)VH. Similarly, we can show &(XuguXL) — X\/(g\/XL). 

Exercise 11.10.11 Assume the neurons are of sigmoid type. We have 

x[ 1] = a(w 11 xf ) + bi) 

W = a(w 12 x[ 0) + w 22 xf ] + b 2 ) 

W = cr(w 23 X^+b 3 ). 

Since X ^ depends on X®, we obtain X W C X^\ We can also solve for 

X(°)in terms of X^\ which implies X C X^\ 

Exercise 11.10.12 Similar to the solution of Exercise 11.10.11. 

Exercise 11.10.13 If X = £ then obviously X\£ = 0, so C — &(X\£) — 
(3({0}) = {0,11}. Conversely, if C — {0,11}, then &{X\£) — {0,0}, so 
either I\£ — {0}, or I\£ — {O}, or I\£ — {0,0}. The first case implies 
X — £. The other two cases lead to contradiction, since 11 is subtracted with 
£ and cannot belong to I\£. 

Exercise 11.10.14 Assume, by contradiction, that XL C £ C X . Using Propo- 
sition 11.4.5 part (a), the first inclusion yields that X — C. Similarly, Propo- 
sition 11.4.5 part ( b ) applied to the second inclnsion implies I ^ C. The two 
statements obtained are contradictory. 

Exercise 11.10.15 It is a consequence of Exercise 11.10.14. 

Exercise 11.10.16 Let U CI X~ be a subheld of a recoverable information field. 
Then X\X C X\U. Taking the ©-operator yields X — G(X\X) C It 

follows that the inclusion is in fact equality. 


662 


Deep Learning Architectures, A Mathematical Approach 


Chapter 12 

Exercise 12.13.1 Use the inequality 


H(X) = [ 
J e 


oo 


x(lnx) : 


(ln x + 2 ln ln x) dx > / 

J e 


oo 


x\nx 


dx — +oo, 


Exercise 12.13.2 (a) Let /i = £[X], <r 2 = Var(X) and consider the normal 


(x-n) 


-i _ V h*' J 

density q(x) — — 7 = e 2 ^ . Then use the inequality 

v 7 <7\/2n 


H(X) — — f p(x) lnp(x) dx < — j p{x) ln q{x) dx — - ln(27recr 2 ) 
J R JR 2 


( b ) It follows from 


—H{X) — / p(x) lnp(x) dx < InM / p(x)dx — lnM. 

J M Jr 


Exercise 12.13.3 (a) We have 


I(X,Y\Z) = / p(x,y,z)lnp(x\z)dxdydz + / p(x,y,z)\np(y\z)dxdydz 


— j p(x,y,z)\n.p{x,y\z) dxdydz 

< m p(x,y,z) ^ ^ ^ 
pix, v, z) In , , . —— axayaz 
,y ’ ’ p(x|2)p(y|z) y 

Dkl\ p(x, y, z ) | |p(z|z)p(y|V. 


(6) Use that the Kullback-Leibler divergence is nonnegative and the fact that 


D KL \p(x,y, z)\\p(x\z)p(y\z)] = 0 


if and only if p(x,y,z) — p(x\z)p(y\z), i.e., when X and Y are independent, 
given Z. 

Exercise 12 . 13.4 It suffices to prove that the System cannot have more than 
two distinet Solutions. Assume there are two distinet Solutions, X\ and X2, 
and consider their difference Xq = X\ — X 2 7^ 0 . Then AX 0 = AX\ — AX 2 = 
b — b — 0 , so Xq is a nonzero solution for the homogeneous System AX — 0 . 
For any a E M, X = aXo is also a solution of the homogeneous System, since 
AX — XAXq — 0 ; hence the space of Solutions 


S = {X G M n ; AX = 0 } 










Hints and Solutions 


663 


is at least one-dimensional. On the other hand, a well-known resuit (based on 
the isomorphism theorem) States that dim<S = nn — rank(A) — nn — m — 0, 
which leads to a contradiction. It follows that Xq — 0, i.e., X\ — X 2 . 

Exercise 12.13.5 If m — n then Q is invertible, then p(x) = Q 1 p(y) — 
Q- 1 softmax^Q- 1 h ). 

Exercise 12.13.6 Using that the natural logarithmic function is increasing, 
we have ln(^ -e^) > \n(e Uj ) = Uj. Hence, ln(JXe^) > maxjUj. Then 

choose Uj — ((Q T Q)~ 1 Q T h)j. 

Exercise 12.13.8 For (a) and (6) apply the formula for neural manifold 
dimension. (c) The one at (6) has a larger capacity because has more param- 
eters. 

Exercise 12.13.9 (a) increases, (6) decreases, (c) decreases. 

Exercise 12.13.10 (a) Using an analog of relation (12.11.34) for the one- 
hidden layer network, applied to the hrst and second layers, and then to the 
second and third layers, we have, respectively, 

G(X) = 6(17, X 1(n ,..., X 784 ), 6(17) = 6(?, U 1U ..., U 1Q0 ). 

Concatenating, yields 

6(X) = &(Y , Un,..., C /1005 TCioi, • •., 7^784)- (2) 

(■ b ) From relation (12.11.34) applied to the zero-hidden layer network, with 
input X and output T, we have 

6(X) = 6(y,Xn,...,X 784 ). (3) 

Comparing formulas (3) and (2), the transitivity implies 

6(y, X .\\,..., -Ahoo 5 X1015 • • • 5 -^ 784 ) ("F , Uw , • • •, U1005 X1015 • • • 5 -^ 784 ) • 

(4) 

Exercise 12.13.11 The maximum entropy of the input variable X is about 
9.614 bits per image. The entropy of X, given C/, is equal to the average 
number of pixels of X corresponding to each entry of U 

cyH(X\U) 

100’ 

which implies H(X\U) = 2.9708 bits. Then the mutual information of X and 
U is given by 

/(X, U) = H(X) - H(X\U ) = 9.614 - 2.9708 = 6.643, 



664 


Deep Learning Architectures, A Mathematical Approach 


namely, about 6.643 bits of information of each image are used toward each 
entry of the hidden layer. From the previous analysis of the zero-hidden layer 
network we have I(X,Y ) = 3.322. These lead to the inequalities 

H(X) > I(X,U) > I(X, Y). 

Exercise 12.13.12 Using that H(Xf l \X v ) — H(Xh,X v ) — H(Xf l ) and 
H(X V \X h ) = H(X v ,X h ) - H(X V ), it follows H(X h \X V ) = H(X V \X h ) if and 
only if H(X h )=H(X v ). 

Exercise 12.13.14 From the invariance property, see Proposition 12.7.5, we 
have I{X' h ,X' v ) = I(X h ,X v ). Using formula I(X,Y) = H(X) + H(Y) - 
H(X, Y), it follows that H{X' h ,X' v ) = H(X h ,X v ) if and only if 

H{X' h ) + H{X' V ) = H(X h ) + H(X V ). 

Exercise 12.13.16 (a) By the definition of mutual information, we have 

I(X, Y, Z) = H(X) + H(Y) + H(Z) - H(X, Y, Z) 

= [H(X) + H(Y) - H(X, U)] + [H(X, Y) + H{Z) - H(X, Y, Z) 
= I(X,Y) + I((X,Y),Z). 

(■ b ) It can be shown similarly that for any 1 < k < n we have 

I(X U ..., X n ) = I(X U ..., X k ) + I((X U ..., X k ), (X k +u ..., X n )). 


Exercise 12.13.19 It follows from the definition of the determinant: 


det W\ = \Y, ■ ■ ■ w nin < Y\ w ih 


^ni 


n 


< n\ c n , 


Exercise 12.13.20 Since 

I{Xt, X 2 ) = H(X!) + H(X 2 ) - H{X 1 , X 2 ) 

we need to compute the entropy terms from the right side. By Exercise 
6.6.11 we have 

H(X l) = 1 ln(27r) + lnai + 1 
H (X 2 ) = 1 ln(27r) + ln a 2 + 1. 


Using that the bivariate distribution of (Xi,X 2 ) is given by 


f{x l,x 2 ) = 


{ 


2 7T(71(72 \J 1 - P 2 


1 

" (^1 —Ml) A x 1-D\){x 2 -D2) 


1 — 

L 2cr^ ^ a \ cr 2 

2-1 


]} 















Hints and Solutions 


665 


the joint entropy term can be computed as 


H(X i,X 2 ) = - JJ f(xi,x 2 )lnf(xi,x 2 )dxidx 2 

— ln( 27 T(Ji<T2 \J 1 — p 2 ) JJ f(x\,X2) dx\dx2 


+ 


1 - p‘ 



f(x i, X 2 ) ( 1 ^ dx\dx2 


2af 


P JJ f(x 1 ,X 2 ) 


(xi - /Lil)(x 2 - /U 2 ) 


dxidx-2 


+ 



CJl(T2 

/(xi,x 2 ) ^ 2 A dx\dx 2 

- a 2 


= ln(27T(Ticr 2 '\/l - /9 2 ) + 


1 - P 2 L2 cricr 2 
1 + ln(27T(7iCr 2 \/l — P 2 ). 


-1 . 

p Cov(X 1 ,X 2 ) + - 


Therefore, after cancellat ions, we obtain 


/(WW) = ^(X 1 ) + ^(X 2 )- J e-(X 1 ,X 2 ) 

= — ln a/ 1 — p 2 = -lln(l 

The mutual information depends explicitly on the correlation coefficient, p. 
Therefore, at least in the case of normal distributions, correlation is an expres- 
sion of the mutual information between the variables. 

Exercise 12.13.21 (a) Applying the invariance property given by Proposition 
12.7.5, we have 


I(x u x 2 ) = I(F Xl (X 1 ),F X 2 (X 2 )) = I(U u U 2 ) 

= H(U 1 ) + H(U2)-H(U 1 ,U 2 ) 

1 

c(u\, U 2 ) ln c(ui, U 2 ) duidu 2 , 

where we used that H{Ui) — 0, since U{ are uniformly distributed on [0,1]. 

( b ) If X\ and X 2 are independent, then the copula is C{ui^u 2 ) — U 1 U 2 and 
its density is 0 ( 14 , 1 x 2 ) = 1. Therefore, using (a), we obtain I(X i,X 2 ) — 0. 

Chapter 13 

Exercise 13.8.1 The dimension of the associated neural manifold is 



r = 784 x 200 + 200 x 100 + 100 x 50 + 50 x 10 + 350 = 182, 300. 
















666 


Deep Learning Architectures, A Mathematical Approach 


Exercise 13.8.2 The dimension of the neural manifold is 

//v(fc) - + l)(y) + + N 

= (^ + ^)j + V JV, + S ' 

If let u = 1 /k the above formula is written as 

(j) n( u) — (d® + d^^Nu + (u — u 2 )N 2 + N 

= —N 2 u 2 + 7V(d (0) + d (L) + iV)u + Ah 


This is a quadratio function in u , which achieves its maximum for 


a = 


d(°) + + iV 

27V 


For AT large in comparison with d + d^ the number of layers is k = 2. 

Exercise 13.8.3 The dimension of the neural manifold is r — 784Af + ION = 
794AA Solving the inequation 794Af > 550000, we get N > 693. Hence, for 
N larger than 700 the networks exhibit overhtting effects. 

Exercise 13.8.4 Solve the inequality h 2 + 796 h > 550,000. 

Exercise 13.8.5 Assume hrst that u and v are arbitrary vectors, so they can 
be decomposed as 


u 


= y +u n n, 


v 


i—1 


=y v j£j+ vNn ’ 

3 = 1 


where {ei,..., e r } is a basis in the tangent space T y S and N is the normal 
unit to S at y. Their Euclidean inner product is 


(u,v) = (y Vj€j) + u N v N 

i=l .7 = 1 

= y UjVj (ei, €j) + u N v N 

E , N N 
UiVjgij + u v . 

If the vectors u and v are tangent to 5, their normal parts are zero, so 
u N — v N — 0, and hence (u,v) — ^ u i v jgij • Then use that u and v are 
orthogonal iff (u, v) = 0. 

Exercise 13.8.6 An affine subspace of R n is a k- plane in R n , with 1 < k < n. 
Any geodesic in this k- plane is a straight line. Also L — 0, because when a 




Hints and Solutions 


667 


vector field U is differentiated with respect to a vector field V, both U and 
V being in the given fc-plane, the directional derivative, VtW, belongs to the 
same fc-plane, so L(C/, V) — (Vj/V r ) _L = 0. 

Exercise 13.8.7 For any two vector fields, U and V, tangent to S, we have 
the orthogonal decomposition X7jjV — (V/yV^H + (Vj/V r ) _L . For any curve c(s) 
in S, take U — V — c(s ) and obtain the kinematic eqnation 

Vc( s )c(s) = (Vc(s)c(s)) 11 + (Vefsjcls)) 1 = D d(s) c(s) + L(c(s ), c(s)), 


where V and D are the directional derivatives on Ai and 5, respectively, and 
L is the second fundamental form of S with respect to Ai. We use that a 
geodesic is a curve with zero acceleration. 

(a) ( b ) Assume L — 0. Then Vc( s )c(s) = D^c(s). If c(s) is a geodesic 
in S, then D^c(s) — 0. Then Vc( s jc(s) = 0, so c(s) is a geodesic in Ai. 

( b ) => (a) If D^c(s) — 0, then Vc( s )c(s) = 0, and hence L(c, c) = 0, for 
any geodesic c(s) in S. Since for any given vector v G T p S , there is a geodesic 
with initial velocity c(0) = v, it follows that L(v, v) — 0, for all v G T p S , with 
p arbitrary in S. Then use the polarization formula 


L(v, w) 


1 

2 


[L(v + w, v + w) — L(v , v) — L(w , w)] 


to get L — 0. 

Exercise 13.8.8 (a) It follows from the change of variables formula in an 
integral. By chain rule, 7 '{t) — c(s)0 / (t), so taking the norm, and using that 
0 is increasing, we get || 7 / (t)|| = \\c(s) Then 



*d 


/ d 

||c( 0 (t))|| dt 


-ds 


f b 

/ ||c(s)|| ds — L(c). 

J a 


Similar computation for the energy. 

( b ) Apply the integral version of Cauchy’s inequality 


( [ f(x)g(x)dx\ < f f(x) 2 dx f g(x ) 2 dx 
^ Ja ' Ja da 

for / = ||c|| and g — 1. The identity is reached for constant speed curves, 
that is, ||c(s)|| = constant. 

(c) Locally, length minimizing curves and energy minimizing curve are equiv- 
alent. A geodesic is also an energy minimizing curve; along a geodesic the 
acceleration is zero, and hence the magnitude of the velocity is constant. 
















668 


Deep Learning Architectures, A Mathematical Approach 


Exercise 13.8.9 (a) Let (r(t),z(t)) be a plane curve with r(t) > 0. If the 
curve is rotated about the z- axis, a surface of revolution is obtained. This is 
parametrized by 


</>(£, s) — (r(t)coss, r(£)sins, z(t)) 


0 < s < 2n . 


where t measures the position on the given curve and s measures the rotation 
angle. It can be shown (see the book [85], for instance) that the coefficients 
of the second fundamental form of a revolution surface are given by 


ATo 7 To 

yr + 


rz — zr 0 


0 


rz 


We shall apply this to two parametrizations of the unit sphere. 

(i) Choose r(t) — cos t, z(t ) = sint, —tt/2 < t < tt/2 and obtain the sphere 
parametrization 

</>(£, s) — (costcoss, cos t sin s, sint), 0 < s < 2tt, —n/2 < t < tt/2. 
Using the previous formula yields Lij = ( \ + * • The norm of L is its 

\ 0 COS i 

largest eigenvalue, so ||L|| = 1. 

(ii) Let r(t) = Vl — t 2 , z(t) = t, —1 < t < 1. The parametrization is 

0(t, s) — (\/1 — t 2 cos s, a/ 1 — t 2 sin s, t), 0 < s < 2tt, —1 < t < 1. 


i 


A computation shows that L ^ = d 


0 


ij — \ - ^ | . The norm is, again, L — 1. 

We make the remark that even if the coefficients are dependent on the 
parametrization, the norm ||L|| is not. 

Exercise 13.8.10 The outputs of the sigmoid neurons are y\ — a(w\x + 6i) 
and y 2 — cr(w 2 X + 62 )- Their combination has the output 


y = Ayi + (1 - X)V2 = Xcr(wix + 61 ) + (1 - A )a(w 2 x + 6 2 )- 

Since (rei, rc 2 , 61 ,5 2 , A) G R 5 , the associated neural manifold has dimension 5. 

Exercise 13.8.11 When a neuron is dropped, all in the incoming and outgoing 
weights as well as its bias are equal to zero. Hence, the dimension of the 
neural manifold decreases. Having fewer neurons in the network produces an 
outcome which is less nonlinear. Hence, the neural manifold tends to have a 
smaller embedded curvature. Both reduce overhtting and if too many neurons 
are dropped, it may lead to an underfit. 
















Hints and Solutions 


669 


Exercise 13.8.12 All formulas are applications of the product rule and sym- 
metry of the second derivative. 


Chapter 14 

Exercise 14.13.1 (a) Consider the expansion = Y^j=i a ij v j- Then 



ej Ge k = ( 


n 


a ij v j 

3 = 1 


T 


n 


g rr 


: 1 


Oi]^ T V r ) — 


E 





so G = O n . ( b ) It follows from using linearity, part (a) and considering 
G = A — B. 

Exercise 14.13.2 Let Y\ — u T X and Y 2 — v T X. The joint cumulative distri- 
bution function is 


Fy-iYi («. b) 


P(Y\ <a,Y 2 <b) = P(u T X < a, v T X < b ) 
P(u\Xi + u 2 X 2 < a, v\X\ + v 2 X 2 < b ) 

// px l x 2 (xi,x 2 ) dxidx 2 . 

J J{u T x<a}n{v T x<b} 


We compute the ratio 


Fy ± y 2 ( a + Aa, b + Ab) — Fy x y 2 ( a > b) 

AaAb 


1 

AaAb 



Px 1 x 2 (xi%2) dxidx2. 


where D a b — {a < u T x < a + Aa} n {b < v T x < b + Ab} is a rectangular 
region with side directions parallel to u and v. Using that R~ 1 (D a i > ) — [a, a 
Aa, b + Ab, changing variables, using Fubinks Theorem and the fact that 
is rotational invariant, we obtain 


AaAb 

1 


/ / px x x 2 (x\X2)dx\dx2 
J ^D a b 



AaAb ^ 

[a,a+Aa] x [b,6+Ab] 


PXiX 2 (R(x 1 ^X 2 )) | det R\ dx\dx 2 


: 1 


Aa 


■a+Aa ^ 

Px 1 (^ 1 ) dxi 


*a+A6 


a 


Ab 


btv 2 ( x i) <^i- 


6 


Taking the limits Aa -0- 0 and Ab —> 0, we obtain 


a 2 


<9a<9b 


FvlV 2 (a, 6) = px 1 (a)px 2 ( b ) 


+ * 










670 


Deep Learning Architectures, A Mathematical Approach 


or /yiY 2 (a, b) — /i (a)/ 2(^)5 which implies that Y\ and I 2 are independent. 

( b ) Y\ — u T X is normally distributed with zero mean and Var(Y\) — \u\ — 1. 
Hence, u T X, v T X ~ A/*( 0 , 1 ). 

(c) If follows from the trigonometric circle. We take — arg u and since u and 
v are orthogonal, avgv — (j) + n/2. Then v — (cos(</> + tt/ 2), sin(</> + tt/ 2)) = 
(—sine/), cos 0 ). 

Exercise 14.13.3 (a) We have 

y -(£) _ U z)(xi)= Mx - 

By Exercise 14.13.2 M is a rotation matrix and it preserves independence. 

( b ) If Y\ — u T X and Y 2 — v T X are independent, then Y[ — au T X and 
Y' — /3v t X are also independent, for nonzero constants a,/?. 

Exercise 14.13.4 We expand the determinant D( A) = det(M — AI n ) over the 
hrst row and then expand each of the (n — 1 ) minors over the hrst column to 
obtain 



-A 0 

... Q 


b\ 0 

0 

D{ A) = -A 

• • 

• • 

• 

0 • • • 

-A 

• 

• 

• 

0 

— a\ 

• • 

• • 

1 

yi ... 

r-O 

0 

• 

• 

• 


0 • • • 

0 

-A 


b n • • • 0 

-A 



b\ —A • • • 

0 


b\ —A • • • 

0 

+ CL 2 

&2 0 • • • 

• 

• 

0 

• • 

• • 

• 

+ • • • + (—l) n+1 a n 

• • 

1 

o 

. . 

r-O 

0 

• 

• 


• 

b n * * * 0 

• 

-A 


b n ‘ * * 0 

• 

-A 


= -A(-A) n -aibi(-X) n 1 -a 2 b 2 (-X) n 1 -b (-l) 2n+1 a n 6 n (-A) n 1 

= (-A) n_1 [A 2 - = (-A) n_1 [A 2 - a T b\. 


Solving the equation D( A) = 0, we obtain the desired Solutions. 

Exercise 14.13.5 (a) Use cr'(x) — cr(x)(l — cr(x)). (c) Let \\c/)'\\ — 
Then 


sup x \(f)'(x ) 


0 < g 00 = E Px [4 >'(w t X + b) 2 } < W\\ 2 E Px [1 
9ok — / x k <p\w T x + b) 2 p{x)dx 


= \\<t> 


/ II2 


x k\/p( x ) i* x + b) 2 \/p(x) dx. 
















Hints and Solutions 


671 


Then Cauchy-Schwartz inequality implies 


9ok — / { x k Vp ( x )) 2 d x / (rc T x + b ) 2 \/p(x)) 2 dx 

4p(z) d* I + ifp(p) dx<uvf 4 p(p) dP 


= i wrmti 

Similarly, we have 

9 % < [ (xjXky^x )) 2 dx [ ( <P'(w T x + b) 2 yffix )) 2 dx 


— J x 2 x\p(x) dx j <p'(w T x + b) 2 p{x) dx < \\<fi '\\ 2 J x 2 x\p(x) dx 


= wfnxjxa 

To obtain part (6) use that 
Exercise 14.13.6 


2 


(J 


1 

4’ 


goo = E [(j)' {w T X + b) 2 ] — / (\) (w T X + b) 2 dx i • • • dx n 

j i n 

I pwi pw n 


m • • • Jo 


(j)'{u\ + • • • + u n + b ) 2 dui • • • du n , 


o 


Similar computation applies to other coefficients. 


Exercise 14.13.8 


Qajcxk = E [^(wjX + bjMwkX + bk)] = 


Similar for other coefficients. 


1 „2 




7r 


<j)(wjX + bj)(j)(wk + bk)e dx 


Exercise 14.13.10 Taking the logarithm in the relation 

Px u x 2 (xi, x 2 ; 0) = Px 1 (xi; 0)p X2 \x 1 (x 2 \xi; 9) 

we get £xi,x 2 (0) — ^x 1 (0) + J ^x 2 \x 1 {^)- Differentiating, then multiplying by 
Px 1: x 2 ( x h x 2]0) and integrating in x\ and X 2 , we obtain 

E p -i- 2 [d 0t9j e Xl ,x 2 (0)] = [de^M (0)] + [de^ix^x, (0)], 

where we used f Px 2 \x 1 9) dx 2 — 1. This means 

9ij (*i , X 2 ; 0) = 9ij (X 1 ; 0) + 9ij (X 2 \X 1 ; 0). 















672 


Deep Learning Architectures, A Mathematical Approach 


Exercise 14.13.11 (a) It follows from the definition of the Euclidean scalar 
product. 

( b ) using the definition of the gradient, we have 


s(v„/,x) 


E svW.frx’ = E ^o ik w k xi 

ij hj,k 



of_ 

oo k 


N 


= 


= T, X 


k 


k =1 


df_ 

dO k 


(c) It follows from (a) and ( b ). 
Exercise 14.13.12 (a) We have 


v./ll? 


= 9(v s /,v 9 /) = E9«(v 9 /)‘(v s /y 


h3 


E 0ij9 


i k Of _ jp Of 


OOk 9 00 


p 


gkp d/ df 


k,p 


d6 k d6 


= (VW) T g-\o)v Eu f. 


( b ) Similarly, 


= Y,so(VE U fy(VEuf) j = (v E uf) T g(0)v Eu f- 


h3 


(c) Let (V Eu f)(p) — 0. By part (a) we obtain ||(V^/)(p)||^ = 0 and since g 


is nondegenerate, it follows that (X g f)(p) = 0. Conversely, assume now that 
(v 9 /)(p) = 0. Using formula §f = W 9jppJgf) 3 it; follows that §f(p) = 0, 

w p g w p 

and hence (X Eu f)(p) = 0. 


Exercise 14.13.13 (a) E [/i] = ^ELi E [V] = "f = P- 


n j ■■ 


(b ) Since p fl (x) — -^=e 2 ,then9^1np5> — — l,so/(/i) — —E^lnp#] — 1. 

(c) The Fisher information contained in n independent random variables, 
Xi ,..., X n , is the sum 


n 


I(X 1 ,...,X n ;0) = J^I(X i ;0 ) 

.7 = 1 


= n. 


Then Uar(/t) = ^ Y,j Var ( x i) = ^nVar(X i) = £ = i.e., the 

Cramer-Rao bound is achieved. 

Exercise 14.13.14 Since we know that A = E[X], it is natural to consider 
the estimator A = ^ with Xj ~ Pois(X) independent random vari¬ 

ables. The Fisher information contained in n independent Poisson distributed 


1 


1 


1 




















Hints and Solutions 


673 


random variables, Xi, ..., X n , is 


n 


I(X i,...,X n ;A) = ^/(X i; A) = n/(X;A) = -nE[^lnp A (X)] 

J = 1 


= 


A;>0 


X k e~ x 

fc! 


A 

<9|(&Tn A — A) — ne -A —— 


/c>l 


(fc-1)! 


n 


n 


-e~ x e x = - 

X X 


The estimator A is unbiased, with variance 


1 A 

Var(X) — -^nVar(Xi) — — — 


U‘ 


n I(X i,...,X n ;X) 


Exercise 14.13.15 Since X ~ A/*(/j., ^), then p^(u-, 0 — 7 ^=e 2 ^ > so 

the log-likelihood function becomes 


n / \ -i yf~N N «2 

%(M) = ln - 7 = - y (« - M) • 


The square of the score function is {d^^iji) ) 2 = N 2 {X — p) 2 . Therefore, the 
Fisher information induced by X is 

I(X) = E p x[(d f2 £ Y (p)) 2 ]=E p x[N 2 (X-p) 2 } 

= N 2 Var(X) = N 2 j- = N. 


From Exercise 14.13.13, part (6), we have I(X) — 1. Hence, I(X) — NI(X). 

Exercise 14.13.16 (a) Since px 1 ,x 2 ( x i? x 2 ', 0) — PXi ( x i] @)px 2 { x 2\ 0), taking 
the logarithmic function, we obtain £x 1 ,x 2 {0) — ^Xi(0)+^x 2 (0)- Differentiate 
to get 

d$ i e j ix l ,x 2 (0) = dotf/x! (0) + de i e j lx 2 {0)- 

Then multiplying by px 1 ,x 2 ( x i, x 2 ; &) and integrating with respect to x\ and 
x 2 , using f px 1 ,x 2 (xi,x 2 ;9)dx 1 = px 2 (x 2 ;9) and f px!,x 2 (xi, x 2 ; 6) dx 2 = 
PXi( x i',0), leads to 

E Px i x 2 [, de i e j X,x 2 m = ® PjCl [de^Xm + E Px 2 [de.e/xM] 


which is equivalent to 


9n (x 1 ,x 2 -,e) = 9ij (x l -o)+ 9ij (x 2 -o). 












674 


Deep Learning Architectures, A Mathematical Approach 


( b ) If Xi ,..., Xjsi are independent random variables, then 

N 

g(X 1 ,...,X N -,e) = Y,9(X j -,0). 

3 = 1 

(c) If Xi,..., Xjsi ~ X are i.i.d., by part ( b ) we have 

N 

g(Xi, ...,X n -, 0) = J2 9(Xj-,e) = Ng(X ; 0). 

3 = 1 


Therefore, the inverse is 


g(X 1 ,...,X N -,d)~ 1 = jj9(X-,0) 


-1 


(d) 6(N ) is an asymptotically efficient estimator for 0 if for N —> oo 


C<w(0(iV)) = E[(0(AT)-0)(0(AT)-0) 


T 




g (Xi,...,x N] e)- 1 = 


-i 


Multiplying by X and taking the limit, we obtain 

lim JVE[(0(1V) - 0)(0(N) - 0) T ) = g~ 1 (0). 

oo 


Chapter 15 

Exercise 15.6.1 (a) For any fixed index k, we have || C t C (V Using the 

i 

monotonicity property of © yields ©^j^jC^ C ©(£&)• Since k is arbitrary, 

i 

it follows that © (p|e) cpi©(c fc ) , which is the desired relation. 

i k 

(b ) The proof is similar, starting from |^J C{ D C&. 

i 

Exercise 15.6.2 The set of participants form the input data, X, into a neural 
network. The participants left after each round represent the layers of the 
network. There are n — 1 hidden layers and each of them implements a max- 
pooling method by which the number of units decreases to half. 

Exercise 15.6.3 For the sake of simplicity, we shall assume that the input has 
been partitioned into two classes, {ai, < 22 , a 3 }, { 61 , & 2 5 ^ 3 }, with three elements 
each. 




Hints and Solutions 


675 


max min 




a 


max average 



average max 



b 


Figure 3: For Exercise 15.6.5: a. Switching a min-pooling with a max-pooling 
layer provides different outputs. b. Switching an average-pooling with a max- 
pooling layer provides different outputs. 


(a) It follows from 

max{max{ai, < 22 , < 23 }, max{&i, 62 , ^ 3 }, } = maxjai, a 2 , a 3 , 61 , 62 , ^ 3 }. 

(c) The average of the averages counts as only one average 

1 f a\ + a 2 + a 3 | b\ + 6 2 + 63 ^ _ 1 ( , , , u u u ^ 

2 ^^- 1 -^- J — g \ a i + <22 + &3 + b\ + 62 + 03 ). 

(c) Use the relation 

min{min{ai, a 2 , a 3 }, min{&i, 62 , ^ 3 }, } = minjai, a 2 , a 3 , 61 , 62 , 63 }- 
Exercise 15.6.5 It follows from the counterexamples provided in Fig. 3 a, b. 

Chapter 16 

Exercise 16.9.1 Slide the kernel and convolve to obtain 

Exercise 16.9.2 (a) Dehne (T a o y)i — yi- a . Then the eqnivariance property 
in one dimension becomes C(T a oy) = T a o C(y). The proof is very similar to 
the one in two dimensions. ( b ) Yes. 





676 


Deep Learning Architectures, A Mathematical Approach 


Exercise 16.9.3 (a) Performing the convolution, we obtain 


/ CLi-lJ-l a i-l,j a i-l,j+l \ 

I — 1 ®i,j I * K 

\ a i+l,j—l a i+l,j a i+ 1,J+1 / 


— {fli— l,j — 1 —l) T T (tt^—l,j + l ^i+lj+l)? 

the effect being to subtract rows (i — 1) and {i + 1). 

(6) It is similar, the effect being that of subtracting columns. 

Exercise 16.9.4 (a) The convolution with this kernel produces a moving 
average in 2-D, which yields an uniform blur given by 


-(«11 + «12 + &13 + a 21 + &22 + &23 + a 31 + a 32 + < 233 ). 

y 


( b ) This is a blur which emphasizes more the center, then the sides middles 
and then the corners. The sum of all weights is eqnal to 1. 

Exercise 16.9.5 Performing the convolution, we obtain 




&i— l,j+l 

\ 

/ 

0 

-1 

0 

a ij- 1 

a hJ 

a i,j +1 


* 

-1 

5 

-1 


a i+l,j 

a i+l,j+l 

/ 

\ 

0 

-1 

0 


— 5 &ij (cLi—Xj T CLiJ — 1 T T l,j)* 

The contrast results from the difference between the 5 times the Central pixel 
activation and the vertical and horizontal neighboring pixel activations. 

0 0 0 

Exercise 16.9.7 The kernel is K = I 1 —1 0 I . The convolution with K 

0 0 0 

of a 3 x 3 matrix is We can also consider the equivalent kernel 

0 0 0 

K= \ 0 1-1 

0 0 0 

Exercise 16.9.8 (a) Using the equivariance of convolution and invariance of 
pooling to translation, we have 


V O C(T a jj o y) = V O T a fj{C oy)=V O C(y). 

( b ) Similar properties used in reverse order. 

Exercise 16.9.9 Just in the FC layer. The other layers are already sparse 
enough and a dropout would not provide substantial improvements. 



Hints and Solutions 


677 



F 


Figure 4: For Exercise 16.9.11. 

Exercise 16.9.10 Being sparse, a CNN has fewer weights and hence a smaller 
capacity than a fully-connected neural network. 

Exercise 16.9.11 (i) We have N — F — kS , where k is the number of steps 
the kernel is moved from one end to the other. The size of the output is one 
more than k, i.e., O — k + 1. Solving for k yields O — N $ F + 1 . In the case 
of padding, we add 2 P to the dimension N (one P for each side) and obtain 
the second desired formula, (ii) We apply the same procedure as in part (i). 
See Fig. 4. 

Exercise 16.9.12 (i) Since H is a subgroup of G, we have e E i 7 , and then 
0 C — 0 C (3 E xH. (ii) Assume xH n yH — 0 and we shall show that xH — yH 
using double inclnsion. There is z G xH n yH. So, there are h\, fi2 E H snch 
that z — xh\ — yJi2- This implies x = yh^hF^ . Since E H, it follows 

that x E yH, and from here we get xH C yH . Similarly, y — xhih ^ 1 E xH, 
and then yH C xH. ( iii ) The function : H — > xH dehned by y>(h) — xh 
is a bijection, and hence H and xH have the same number of elements. (iv) 
(a) Since x~ Y x — e E H, then x ~ x. ( b ) li x ~ y, then y~ x x E H, and its 
inverse (y~ 1 x )~ 1 E H , which becomes E H , i.e., y ^ x. (c) Ii x ~ y and 

y ^ z then E H and E H. Since H is closed to multiplication, we 
have z~ l x — (z~ 1 y)(y~ 1 x) E H , which means x ^ z. 

Chapter 17 

Exercise 17.10.1 (a) It can be shown by double inclnsion. The inclusion “c” 
follows from applying the ©-operator to the inclnsion 

Gi U G 2 U £/3 C &(Gi U G 2 ) C 03- 

The inclnsion “d” follows from the following. First, since G 3 C Gi U G 2 U G 3 , 
then G 3 C 6 (G 1 U G 2 U Q 3 ). Similarly, from Gi U G 2 C Gi U G 2 U Gs it follows 

&(Gi U G 2 ) C &(Gi u G 2 u (? 3 )« 





















































678 


Deep Learning Architectures, A Mathematical Approach 




V 




Figure 5: A “2-to-l” RNN for Exercise 17.10.3. 


From the last two inclusions we get 

&{Qi U Q 2 ) U&C &{Qi UfeU Qs). 

Taking the ©-operator on both sides we obtain the desired inclusion. 

( b ) We can show inductively that 

6(01 u • • • u Gn)=& (&{Gi u • • • U Gn-l) u G n ) • 

&(Gl u • • • U Gn) = 6 ( 6(01 u • • • u Gn-k) U Gn-k +1 U • • • u Gn) ■ 

Exercise 17.10.2 (a) The eigenvalues of W are Ai = 1/5 and A 2 = —1/2. 
Since both are less than 1, apply Proposition G.1.2. (6) Since p{W ) < 1, 
apply Proposition G.1.2. 

Exercise 17.10.3 Hint: Follow the example presented in section 17.5 for 
Fig. 5. 

Exercise 17.10.4 From Y — Vh^ + c, we get E C 7^2- From 

h 2 = tanh(W/ii + UX 2 + b ) 
hi — tanh(VF/io + UX\ + b) 

we can show that EL 2 C 0(Zi,Z2). Then E C @(Xi,X 2 ). So, if &(Z) C ^ then 
&(Z) C @(Xi UX 2 ). Hence, (a) holds true. 








Hints and Solutions 


679 


Exercise 17.10.5 The transition function / is a contraction. Then lim n ^ 00 h n 
is the fixed point of /, i.e., tanh(0c) = c. It follows that c — 0. 

Exercise 17.10.6 Since f'(x ; 9) — 9<j'(9x) — 9a(9x)(l—a f (9x)) <9 /4 < 1, the 
transition function / is stili a contraction. The fixed point satisfies c(Oc) — c. 
It is unique but cannot be determined in closed form. 

Exercise 17.10.7 If 9 is large the graphs y — sin(0x) and y — x intersect 
more than once, each representing a fixed point. Depending on /io, the System 
might settle to one of those points. 


Chapter 18 

Exercise 18.11.1 (a) The rectangle rule is easily satished, since if three 
corners of a rectangle have rational coordinates, the fourth corner has the 
same property. 

( b ) No, it’s not. Since (l/\/2,1 / a/2) ^ (h x h) H(Qx Q), S is not reflexive. 

Exercise 18.11.2 By contradiction, assume there are two distinet 

(k — l)-hyperplanes containing the points. Then the points belong to the 
intersection of the hyperplanes, which will be a hyperplane of dimension 
strictly less than k — 1, which is in contradiction with the fact that the points 
are in a general position. Hence, there is a unique (k — l)-hyperplane con¬ 
taining the points. 

By contradiction, assume the points are not in a general position, so 
they belong to a hyperplane of dimension p, which is strictly smaller than 
k —1. This p-hyperplane is always inside the intersection of two distinet (k— 1)- 
hyperplanes. Then the points belong to two distinet (k — l)-hyperplanes, 
contradiction. It follows that the points are in a general position. 


Exercise 18.11.3 By contradiction, assume the vectors are not linearly inde- 
pendent. If consider the span S — span{P\P 2 ) ..., PiP/c}, then dim<S — s < 
k — 1. Then the points P{ belong to a hyperplane of dimension s, which 
contradicts the fact that they are in a general position. 


Exercise 18.11.4 Conditions f(x{) — 1 and f(x 2 ) = 2 can be written as the 
linear system 


ax 1 + b = 1 

ax 2 + b — 2. 

As long as x\ ^ X 2 , the system has a unique solntion (a, b). 

Exercise 18.11.5 (a) We need to show that for any two points A, B G hull{Q) 
the convex combination tA + (1 — t)B G hull{Q). We have A = and 








680 


Deep Learning Architectures, A Mathematical Approach 


B = with X+* = Yjfa = !• Tlien 


tA +(i — t)B — t 'y ' (%iPi + (i — t) y ^ PiPi 

= y y(taj + (i - t)Pi)Pi = ^2 p p i■ 

This is a convex combination of P{ because 

y~! p = * n a i +e ~ t) u & = t +(! - 1)=i, 


and hence tA + (1 — t)B E hull(G)- 

( b ) By contradiction, we assume there is a convex set K in R n such that Q C 
1L C hull(Q). Pick a point Q E hull{Q)\K . Then Q is a convex combination 
of the points in (/, and hence belongs to K , contradiction. 

(c) Use that intersection of convex sets is always convex. 

Exercise 18.11.6 Let r(t) — ro + tv be the line separator of Gi and G2- The 
affine transform T transforms this line into the line p(t) — <L(r(£)) = po + tu, 
with po — wr o + b and u — Wv. Note that u 7 ^ 0 since det W ^ 0. We shall 
show that p(t) is a separator of <&(Gi) and $(^ 2 )- By contradiction, we assume 
that p(t) does not separate <&(Gi) and $(^ 2 )- Therefore, there are some gi E Gi 
such that p(t) does not separate T(pi) and $( 52 )* Consequently, the line 
segment T(pi)T(p 2 ) does not intersect the line p(t). Since r(t) separates pi 
and p 2 5 the line segment P 1 P 2 intersects the line r(t) at some point p. Then 
<3>(p) is the intersection between the line segment T(pi)T(p 2 ) and line p(£), 
which is a contradiction. 

It is worth noting that the statement stili holds as long as the direction 
vector of the separation line is not in the kernel of the matrix W, i.e., Wv ^ 0 . 

Exercise 18.11.7 Let G = {Ai, A 2 , ..., A m }. Then ode — T Y^ k j=i OAp E 
hull(G) as a convex combination of position vectors of the clnster elements. 


Exercise 18.11.9 (a) Let p E A and q E B such that \\poqo 


inf 

p£A,q£B 


pq 


Choose the separation line to be perpendicular on the midpoint of the seg¬ 
ment poqo. ( b ) it follows from part (a). 


Chapter 19 

Exercise 19.7.1 ( i ) In this case the state updates satisfy the System 


3^71+1 


%n Wn 


2/71+1 — Un T r l x m 


with p > 0 learning rate. This can be written in the matrix form 

f X n+ l \ = / 1 —Q \ f X n \ 

V Vn+1 ) V n 1 ) \ Vn ) ' 










Hints and Solutions 


681 


(ii) Therefore 


Vn 


1 —Tj 
Tj 1 


n 


X 0 

yo 


(iii) The matrix A = 


1 — r/ 

7] 1 


has eigenvalues equal to 1 ± irj and has 


the spectral radius p(A T A) — ^/l + y 2 strictly larger than 1 . Therefore, the 
sequence (x n , y n ) will not converge, unless xq — yo = 0. 

Exercise 19.7.2 (i) The System is x'(t ) = —a, ?/(£) = 6 , with solution x(£) = 
—at + xo, y(t) = bt + yo- (ii) The solution is a line in the (x, y)-plane which 
does not approach any equilibrium point as t —> oo. 

Exercise 19.7.3 Continuous case: The associated differential System is x' — 
—x, y' — —y, with solution x(t) — xoe _t , y(t) = yoo~ f . Regardless of the 
initial valne (xo, 2 /o ) 5 the limit is lim^ 00 (x(t), y(t)) — (0,0). The origin is 
an equilibrium point. Discrete case: We have x n+ i — x n — r\x n — (1 — rj)x n 
and y n+ i = y n ~ py n = (1 - p)y n , which implies x n = (1 - 77) n x 0 and y n = 
(1 - 7y) n yo. For any x 0 , y 0 and y G (0,1) we obtain lim n ^ 00 (x n , y n ) = (0, 0). 

Exercise 19.7.4 (i) divi/ = d x U x + 3 y t / 2 = ~d 2 x V + = 0 divi/ = 0 

if and only if — d 2 V — 0. With the change of variables u = x + y and 
v — x — y this becomes d u d v V — 0. This has the solution with separate 
variables V — F(u) + G(v) — F(x + y) + G(x — y ), with F and G arbitrary 
smooth fnnctions. (iii) If divergence is zero, the flow is incompressible and 
hence does not converge toward any point in the long range. 

Exercise 19.7.5 (i) We have 


D* G (x) = 


Pdata(F) 


Pdata(F) F Pmodel(F) 


1 + 


Pdata {A 

Pmodel (*e) 


-1 


_ Pdata 

1 —|- g Pmodel ( x ) 


= (7 ( ln 


Pdata(F) 
Pmodel (F) 


— a(a(x)) 


from where by using the injectivity of a and then taking an exponential we 

obtain e a<y F — — data ^ ^ . (H) It follows from the definition of the Kullback- 

Pmodel (%) 

Leibler divergence 


E 


•P^Pdata 


a(x)] = J Pdata(x) ln 


Pdata(F) 
Pmodel (F) 


dx F)kl (pdata \ \Pmodel) • 


Exercise 19.7.6 We write the condition that both expectations have the 













682 


Deep Learning Architectures, A Mathematical Approach 


critical point at the maximum likelihood estimation 


^0^x^p data W^-Pmodel (*E ^)] 
do^x^pmodei [f (%)] 


0=0 


MLE 


= 0 


= 0 . 


0=0 


MLE 


They are equivalent to 


Pdata(^) 1 HPmodelfai 


f (x) doPmodelfa 0) 


0=0 


MLE 


dx — 0 


dx — 0. 


0=0 


MLE 


One possible function f(x) satisfying this property is obtained equating the 
integrands. We obtain 

Pdata (%)d() ^Pmodel (*E Pdata(^) 


/(*) = 


OoPmodeliX ) ^) 


0=0 


Pmodel ( x 5 ®mle) 


MLE 


Exercise 19.7.7 (i) Since the hrst expectation term of (19.4.4) is independent 
of 0k), we differentiate just the second term and commute the expectation 
and the gradient operators. 

(ii) We apply the gradient to the expectation K z ^ Pcode [ln(l-D(GO,0^)))] 
and use the chain rule 

1 dD(x) dG(z) 


x=G(z ) 


d QG) 


Vum in(i - d ( g ( 3, e 1 » 1 ))) = — D (g(z)) e, 

(iii) Use that = D(x){ 1 - D(x))^g. 

Exercise 19.7.8 The equation ^ = ln = 0 yields y = while the 
equation — 0 implies y — y. The equilibrium point is (y*,y*) = 

V 2 5 2 / ’ 

Exercise 19.7.9 Using chain rule yields 


W (g) J (G) = E 


ZrsJ Pcode 


[V, (g) ln D(G(z,e {9) ))} 


= E 


Zr ^Pcode 


1 d D(x) 

dG i 

\-D(G(z,6G)) dx 

\x=g(z) dOG) . 


Assuming now that the discriminator is optimized, then by Exercise 19.7.5 we 
havei7(x) = Dq(x) — cr(a(x)), with a(x) — ln p dat d gx) • Then using chain rule 

dD ^ x l = a\a(x))^=a(u(x))(l-a(a(x))) ia{x) 


dx 


dx 

= D* G (x)(l - D* a (x)) 


dx 


da(x) 
dx 

























Hints and Solutions 


683 


Substituting, in the previous expression, yields 


W d(g) J^ =E 


Z^Pcode 


(1 -D(G(z,9^))) 


da{pc) dG 


dx 36^- 


dd{x) 9 x Pdata(%) 9 x Pmodel(%^) 

dx Pdata(%) Pmodel(%) 


Chapter 20 

Exercise 20.10.1 Keeping the contribution of the kth neuron separately, we 
have 



£(x') - E(x) 

l ^ n 

WijXiXj - 22 b i X i ~ /2 W ki x i x k ~ hx' k 



i^k j+k 


i^k 


2=1 

n 


+ 2 ^ ^ Wij x i x j + E b{X{ “i - ^ ^ WkiXiXk + b k X k 

i^k j^k i^k 2=1 


22 


J2 w kMx k - x' k ) + b k (x k - x' k ) 


2=1 


22 


^WkiXi + bdix k -x k ). 
2=1 


Exercise 20.10.2 By identihcation of the hrst 7 components, we have 


<7i 

<72 

<73 

<74 

<75 

<76 

<77 


1 

Z 

e bl qi => b\ — ln — 

<7i 

e b<2 qi => &2 = ln — 

<71 

6 3 , 7 i 44 

e 41 => 03 = in — 

<7i 

wo s+52+53 v i 4541 

e 26 41 => 1^23 = in- 

<73 <74 

UM 3+61 +63 v l <76 <7i 

e 1,3 1 => rei 3 = in- 

<72 <74 

^12+51+52 v 1 <77<7l 

rei 2 = in-. 

42 <73 


We identify the last component and substitute the valnes obtained before 

q 8 = e Wl2 e Wl3 e W23 e bl e b2 e b3 qi = 


<? 2 <? 3<?4 
















684 


Deep Learning Architectures, A Mathematical Approach 


Exercise 20.10.3 (a) It follows from a straightforward computation. For 
instance, for the first component, we have 


N 

E Pl -I Pi 

3 = 1 


1 

~Z 


N 


jpne El/T + y^ J Pije Ej/T } 

j=2 




N 


N 


>-E 


z =2 


1 + e( Ei El )/' E 


-E 1 /T 




-Ej/T 


3 =2 


]_ _|_ g(-^i Ej)/T 


1 

z 


N 


__ J e ~Ei/T _ 


N 1 

E e £i/r + e £i/r 1 Z_^ e ^/T + gEi/T 


+ E 


z 


e -£i/T 


( 6 ) In the next computation we shall use that all entries of Pt are nonnegative 


and the sum of the entries on each row is equal to 1 . Using the L 1 norm 


v 


i 


= E 


v 


3 


we have 


Prv\\l = \P J Plj v j\ + --- + \P J Pnj 


' V j 


3 


3 


< 


E Pl 3 


;Wj 


+ 


+ E! Pnj I 


J 


3 


= E(E^)n = E 




i v j 



fW 




* j 


i 


3 


3 


= 1 


If A is an eigenvalue corresponding to the eigenvector u, then Ptv — Xv and 


taking the norm yields ||Pt^||i = |A| 
tion we obtain |A| = < 1. 


v 


i 


Then using the previous computa- 


V 1 


(c) First, we show that q n is convergent. Iterating, we obtain q n — P^qo- Since 
Pj, — (MDM _1 ) n — MD n M ~ 1 , then — MD n M~ 1 qo. Since the diagonal 
matrix D contains only entries with absolute value less than or equal to 1, 
the limit L — lim n ^ oc D n is a sparse matrix having only one entry equal to 1. 
Then the following limit exists: 


q* = lim q n = lim MD n M~ 1 q 0 = MLM ~q 0 . 

n—^oo n—^oo 


1 


On the other side, we apply the limit in the relation q n +i = PrQn and obtain 
q * = P^q *. This implies that (f and p are proportional (we used, without 
proof, that the dimension of the eigenspace with eigenvalue 1 is equal to 1 ). 
Since the sum of their elements is 1 , it follows that q * = p. The physical 




























Hints and Solutions 


685 


significance is that for any initial state, in the long run, the Boltzmann 
machine settles to an equilibrium distribution. 

Exercise 20.10.4 The Boltzmann distribution is p — ^(1, e &1 , e^ 2 , e bl+b2JrW ). 
The Fisher metric involves the computations of the type 


E p [xiX2\ = — [p(0, 0) • 0 • 0 + p{ 1, 0) • 1 • 0 + p{ 0,1) • 0 • 1 + p{ 1,1) • 1 • 1] 

1 


z 

1 


= ^p(l,l) = ^e 6l+62+w . 

The others can be easily computed by the reader. 

Exercise 20.10.5 (a) From ^ ln p(x) = ^ygj-:2?( x ) and ^ln^x) = 
XiXj — E v [xiXj\ yields 

d 








dwij 


p(x) = [XiXj - E p [xiXj} )p(x). 


Then 


_ ^ ^ \ 

^ Wij dw~ p ^ = \z2 W ij X i X j ~ I E P \WijXjXj]Jp(x) 
hj 13 hj 

— — E p [x T rex]^p(x). 


Similarly, 


d 


E = ( xTb - EP [ xTf> ])p( x ) 


k 


and hence 


1 1 

A w ^p(x.) = (-x T iex + x T 6)p(x) — E p -x T rex + x T fr]p(x) 
= (E p [£(x)] - E(x))p(x). 

( b ) Since w[At) — awij{t) and b' k (t) = abj~{t), we have 


,1 - _ -.1 


dtE(x ) — — $t[-x i w(t)x + x^^ &(£)] — —[-x T re / (t)x + x^^ 6(t)] 

Lj E 


T 


\ 

— —<a[-x T re(t)x + x T 6(t)] = aE(x). 

2 


Differentiating in lnpt(x) = —E(x) — ln Z(t), we get 

dt lnp t (x) = -d t E(x) - \d t Z = -aE(x) - -E. E 


= —a 


Z- •• Z(t) x 

+ E p(x)d t E(x) = —aE(x) + a^p(x)£(x) 




= a(E p [£(x)]-£(x)V 
















686 


Deep Learning Architectures, A Mathematical Approach 


Using dtpt( x) = dtln.pt(x.) pt(x), we obtain the desired formula. 

(c) From (a) and ( b ) it follows that Wij(t) — Wije at , &*.(£) = b^e at is a solution 
of the problem. The (local) uniqueness follows from the solution formula 
Pt(x) — e tAw ’ b p(x). 


Exercise 20.10.6 Since rooks attack only horizontally and vertically, it must 
be only one rook on each row and on each column. We start placing a rook in 
the first row. There are 8 possibilities to do that. In the next row you place 
another rook. Since it can’t be in the same column as the first rook, there 
are 7 possibilities left. On the next row there are 6 possible places, and so 
on, until, in the last row there is only one possible position to place the last 
rook. The total number of possibilities is the product 8 • 7 • • • 2 • 1. Hence, 
there are 8! possible ways to place the rooks without them threatening one 
another. These are favorable choices. The number of all possible choices is 
the number of possibilities in which 8 squares can be selected out of 64, that 
is, the binomial coefficient ( 6 g). To get the desired probability, we need to 
divide the number of favorable choices to the number of all possible ways to 

place the rooks, p = J^-. 

\ 8 ) 


Exercise 20.10.7 (a) Using the conditional probability formula, we have 


p(h,v) = 


p(v , h) p{v , h) 


X p -E(v,h) 


—E(v,h) 


p(v) EhP( v ’ h ) hEh e ~ E(v,h) Zh e ~ E(v ’ h) 


^y T wh ^b T v gC T h 


yy gv 1 whgb 1 v^c 1 h gu 


^v T wh^c T h 


,v T wh /D c T h 


^v T wh gC T h 

Z 7 


— E P Z,jl2i v i w ij fl j pY^jCjh] — E 

- Z' Z' 


n 


»£ iViWijhj+Cjhj 


1 


TT -t- e a2i v imj+cj)hj _ 

j Z i 


= Ylp(hj\v) 


3 


with the partition function Z' k — e^ iViWij+c 3' 0 + e^ iViWij+c 3' 1 = 1 + 
e (22i v i Wi 3+ c j). Since p(h,v ) = YljP(hj\ v )i it follows that hj are conditional 
independent, given the visible States. 

(b) Using the previous formula, we have 


p(hj\v) 



(Tn ViWij+Cj)hj 


g(Ei ViWij+Cj)hj 

1 + eiUiViWij+cj) ’ 













Hints and Solutions 


687 


which implies 


p(hj = l|v) = 


e (E iViWij+Cj) I 

1 + e (E iViu>ij+cj) i _). e -(Ei^ w ij+ c j) 


= ^(^ViWij + Cj) 


p(hj = 0\v) = 1 - a( ^2 VjWjj + Cj) = <r( - y Vi 


i^ij 


c i)- 


(c) The log-likelihood function and its partial derivatives are given by 


£(h\v) 

d b J(h\v) 

d c J(h\v) 


^v T wh-\-c T h 

ln p(h\v) — ln- — -= v T wh + c T h — ln Z' 

Zj 


hk-^d Ck z' = h k -±jd Ck (J 2 e 


v T wh+Z h 


= hk - Y' Z e vTwh+cTh h k = h k -^2 p{h\v)h k 

h h 

= h k -W^[h k }- 

i 

d Wi /{h\v) = Vihj + vjhi - —d Wii Z 

= Vihj + Vjhi ~ ^7 Z evTwh+cTh ( v ihj + Vjhi) 

h 

— Vihj + Vjhi ~ ZZ v \ h [vihj + vjhi]. 

(d) The following entries of the Fisher matrix are zero: 

Qbibj 0) QbjCj 0? 9biWij 0* 


The others are given by conditional correlations of the hidden units 


Qacj = Cor(hi,hj\v) 
g WijCkl = Cor(vihj + Vjhi, v k hivih k \v). 

(e) To minimize the Kullback-Leibler divergence Dkl{q( h|v) | |_p(h|v)) is equiv- 
alent to maximizing the cost function C — q{h\v) \np{h\v). By a similar 
computation with the Boltzmann learning we obtain 

A Wij — r](^¥Z h \ v [hihj\ — E Ph \ v [hihj]^ 

A c k = r/(E qh ^[h k \-E Ph \ v [h k \^ 

Ab = 0. 



Appendix 


In order to keep this book as self-contained as possible, we have included in 
this appendix a brief presentation of notions that have been used throughout 
the book. They are basic notions of Measure Theory, Probability Theory, 
Linear Algebra, and Functional Analysis. 

However, this appendix is not designed to be a substitute for a complete 
course in the aforementioned areas; it is supposed just to supply enough 
information for the reader in order to be able to follow the book smoothly. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


689 


Appendix A 

Set Theory 


Let (. Ai)i e j be a family of sets. Their union and intersection will be denoted, 
respectively, by 

Ai — {x; 3i G /, x G Ai = {x; Vi G /, x G Gl^}. 

iG/ iG/ 

If (Ai)i£i and (. Bj)j e j are two family of sets, then: 

(a) (U^)n(U^i) =U(^ n5 i) 

* i oj 

^ (n^) u (n^)=n^ u ^) 

i 3 hJ 

These relations can be generalized to any number of family of sets, 

( A h) OG/i 5 (4> 2 2 G-G 5 • • • 5 

(»0 n (U4) = u (no 

r=l i r r= 1 

OTU(no= n (uO- 

r=l i r ii r .,i p r=l 

The following properties of fnnctions are useful when dealing with sigma- 
fields. They can be proved by double inclnsion. In the following T and y 
denote two sets. 

Propositiori A. 0.1 Let f : X y be a function. Then: 

(a) /(U*) = U/(*)> VA * 

i i 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


691 


692 


Deep Learning Architectures, A Mathematical Approach 


(b) /(f|Ai) =f]f(Ai), VAi c X; 

i i 

(c) /-'(U^) = \Jr 1 (B i ), c 

i i 

(d) /-'(n^) =f]r\B i ), mcy-, 

i i 

(e) r 1 {B c ) = (r 1 (B)) c , VBcy. 

A set A C R n is called bounded if it can be included into a ball, i.e., there 
is an r > 0 such that A C B( 0, r). Equivalently, ||x|| < r, for ali x G A. 

A set A C R n is called closed , if A contains the limit of any convergent 
sequence (x n ) C A, i.e., lim x n G A. 

n—)• oo 

A subset K of R n is called compact if it is bounded and closed. Equiva¬ 
lently, K is compact if for any sequence (x n ) C K we can extract a convergent 
subsequence (x nk )k- The prototype examples for a compact set in R n in this 
book are the hypercube, K = I n = [0,1] x • • • x [0,1], and the n-dimensional 
closed ball, K — B{x o, r) = { x ; \\x — xq\\ < r}. 

Propositiori A.0.2 (Cantor’s lemma) An intersection of a descending 
sequence of compact sets in R n is nonempty. 

For example, if K n — [—1/n, 1 /n], then K n+ 1 C AT n C R and H/c>i ^ = {0}. 

A binary relation “<” on a nonempty set A is called an order relation if 
for any a, 6, c G A we have 

(i) a < a (reflexivity); 

(ii) a < b and b < a then a — b (antisymmetry); 

(iii) a < b and b < c then a < c (transitivity). 

A set A on which an order relation a <” has been dehned is called an ordered 
set , and is denoted by (A, <). If for any two elements a, b G A we have either 
a < b or b < a, then (A, <) is called a totally ordered set. An element m G A 
is called maximal if m < x implies m — x. Consider a subset B C A. An 
element a G A is called an upper bound for B if x < a for all x G B. An 
ordered set (A, <) is called inductive if any subset B C A has an upper 
bound. 

Lemma A.0.3 (Zorn) Any nonempty ordered set which is inductive has at 
least one maximal element. 



Appendix B 

Tensors 


Let /i,..., I n C N be n subsets of the set of natural numbers, N, and consider 
the Cartesian product 

h X • • • X I n = {(n,. .., in)] i k G 4 ,1 < k < n}. 

A set of objects indexed over the multi-index (A,..., i n ) G I\ x • • • x I n is a 
tensor of order n. Typically, we denote it by 

It is worth noting, without getting into details, that the tensor concept 
comes from Differential Geometry and Relativity Theory, where was success- 
fully used to describe manifolds curvature, tangent vector helds, or certain 
physical measures, such as velocity, energy-momentum, force, density of mass, 
etc. 

Many objects in neural networks, such as inputs, weights, biasses, inter- 
mediate representations and outputs, are described by tensors. For instance, 
a vector x G R d , given by x — {x\, ..., xj), is a tensor of order 1 and type d. 
A matrix, A — (. A{j ) G M^ xr , is a tensor of order 2 and type d x r (d rows 
and r columns), see Fig. 1 a. A tensor of order 3, say t G R dxrxs 5 can be 
viewed as a vector of length s of matrices of type d x r. A generic entry of 
this tensor is represented by see Fig. 1 b. A scalar value can be seen as 
a tensor of order zero. 

Example B.0.1 A color image can be represented as a tensor of order 3 
of type n x m x 3, where n is the number of pixel lines, m is the number 
of columns and 3 stands for the number of the color channels in the RGB 
format. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


693 


694 


Deep Learning Architectures, A Mathematical Approach 



Figure 1: a. An order 2 tensor A G R 5x4 . b. An order 3 tensor t G R 7x4x3 





























Appendix C 

Measure Theory 


The reader interested in a more detailed exposition of Measure Theory should 
consuit reference [51]. 

C.l Information and ©-algebras 

The concept of 0-algebra is useful to describe an information structure and to 
dehne later measurable functions and measures. To make the understanding 
easier, the concept of sigma-algebra will be introdnced as the information 
stored in a set of mental concepts. 

We assume the brain is a set of N neurons, each neuron being regarded 
as a device that can be either on or off. This leads to a total of 2 N possible 
brain States. Any subset of this set corresponds to a representation of a mental 
concept. The set of all possible mental concepts will dehne a sigma-algebra 
as in the following. 

Assume that one looks at an object. Then his/her mind takes a certain 
state associated with that specikc object. For instance, looking at an “apple” 
and at a “bottle”, separately, the mind will produce two mental concepts 
denoted by A and F>, respectively. Then, our day-to-day experience says 
that the mind can understand both “apple and bottle” as a mental concept 
represented by the intersection iflB, which contains the common features 
of these two objects, such as color, shape, size, etc. 

The mind can also understand the compound mental concept of “apple or 
bottle” as the union concept AuF>, which contains all features of both apple 
or bottle. The mind can understand that the apple is not a bottle, and in 
general, if an apple is presented, then it understands all objects that are not 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


695 


696 


Deep Learning Architectures, A Mathematical Approach 


an apple. This is done by a concept denoted by A , called the complementary 
of the set A. 

The mental concepts A and B denote two pieces of information the mind 
gets from the outside world. Then it has the ability to compile them into new 
pieces of information, such as d fl B, iUB, A and B. 

If all possible mental concepts are denoted by £, then the previous state- 
ments can be written as 

(i) AuB E £, VA,B E £ 

(ii) AnB e £, VA ,B E £ 

(Hi) A e £, MA e £. 

Relation (i) can be generalized to n sets as in the following: 

(iv) For any Ai ,..., A n E £, then (J U M € e. 

The structure of £ dehned by (i) — (iv) is called an algebra structure. We note 
that (ii) is a consequence of the other two conditions, fact that follows from 
de Morgan relation A n B — A U B. 

Assume now the mind has the capability of picking up an infinite count- 
able sequence of information and store it also as information, i.e., £ is closed 
with respect to countable unions: 

(v) For any A i,..., A n ,... E £, then (j i>1 Ai E £. 

In the case when £ satisfies also condition (T), the structure is called a ©- 
algebra. This will be the fundamental structure of modeling information. 

Remark C.l.l Using de Morgan’s relation, it follows that a &-algebra is 
closed to countable unions, i.e., a>i Ai E £ for any Ai ,..., A n ,... E £. 

Example C.l.l Let £ be a finite set. The smallest ©-algebra on E is £ — 
{0, E }, while the largest is the set of parts £ — 2 E — {V ; V C E}. 

Example C.1.2 Let C be a set of parts of E , i.e., C C 2 E . Then the smallest 
©-algebra on E that contains C is given by the intersection of all ©-algebras 
£ a that contain C 

e(c) = f]s a . 

a 

Here, we note that an intersection of ©-algebras is also a ©-algebra. ©(C) 
is the information structure generated by the collection of sets C. It can be 
shown that it has the following properties: 

(i) CcB^6 (C)c©(l)); 

(ii) C C 6(V) => ©(C) C ©(£>); 

(iii) If C C &(V) and V C ©(C), then &(C) — ©(£>); 

(iv) C CDC 6(C) => 6(C) = 6(V). 



Appendix 


697 


Example C.1.3 Let E be a topological space. The Borei 0-algebra Be is 
the 0-algebra generated by all open sets of E. In particular, if E — R n , then 
is the 0-algebra generated by all open balls of R n . 

Example C.1.4 The Borei 0-algebra B m is the 0-algebra generated by all 
the following collections: {(—oo,x);x G R}, {(— oo,x\;x G R}, {(x,y)\x,y G 
M}, {(x,y\ 5 x, y G R}, {(x, 00)5 G R}. 

Definitiori C.1.2 A collectiori V of sub sets of is called a p-system if it is 
closed under intersections, i.e., 

A,B <eV => An B eV. 

A collection V of subsets of is called a d-system on if: 

(i) n G V; 

(ii) A, B G V and B C A A\B G V; 

(iii) (A n ) n C V and A n /* A => iGP. 

Theorem C.1.3 (Dynkin) If a d-system contains a p-system, then it con- 
tains also the sigma-algebra generated by that p-system, i.e., 

V C V => &{V) C V 

A measurable space is a pair (E,£), where £ is a set and £ is a set of 
parts on E , which is a 0-algebra. 

For instance, the conhgurations of the brain can be thought of as a mea¬ 
surable space. In this case, E is the set of brain synapses (states of the brain). 
The brain Stores information into conhgurations (collections of states of the 
brain). A configuration of the brain is a subset of synapses set E that gets 
activated. The set of brain conhgurations, £, forms a 0-algebra, and hence 
the pair (E,£) becomes a measurable space. 

C.2 Measurable Functions 

The rough idea of measuring seems to be simple: it is a procedure that assigns 
a number with each set. The set can be, for instance, a set of points on a line 
or in the plane, or a set of people with a certain characteristic, etc. However, 
qnestions like How many people have exactly 200 Ib? are not well posed. This 
is because in order to measure, we need an interval to which the number 
should belong to. There might be many people whose weights are between 
199.9 lb and 200.01 lb., and are regarded as having just 200 lb. 

We might face a similar problem when writing a computer program. Let 
L and D denote the length and the diameter of a circle. The program: 



698 


Deep Learning Architectures, A Mathematical Approach 


x — L/D 
if (x — 7 r) 

print (“it is a circle”) 

won’t run, because the quotient will never be exactly equal to 7r. The correct 
version should include an error tolerance: 
x — L/D 

if ( abs(x — 7r) < 0.0001) 
print (“it is probably a circle”) 

Therefore, the correct way to measure is to assign a lower and an upper 
bound for the measured resuit. Here is where the information dehned by open 
intervals, which is a Bored ©-algebra, will come into the play. 

Let / : E -E R be a function, where E denotes the synapses set of the 
brain. The function / is called measurable if regresses any open interval into 
an a priori given brain configuration. This means / _1 (a, b ) G £, for all a, b E 
R, i.e., / _1 (a, b ) is one of the brain configurations, where we used the notation 
/ -1 ( a > b) — {x G E]f(x) G (a, b)}. We can also write this, equivalently, as 
/ -1 (£>m) C £. To indicate explicitly that the measurability is considered 
with respect to the ©-algebra £>m, the function / is called sometimes Borel- 
measurable. Measurable functions can be used by the brain to get aware of 
the exterior world. 

If (F,£) and (F, E) are two measurable spaces, a function / : E -E F is 
called measurable if / _1 (F) G £ for all B E E. Or, equivalently, / _1 (F) C £. 
If (F,£) is the measurable space associated with Ann’s brain and (F, E) is 
associated with Bob’s brain, then the fact that the function / is measurable 
means that “anything Bob can think, Ann can understand.” More precisely, 
any configuration state B in Bob’s brain can be regressed by / into a config¬ 
uration present in Ann’s brain. 


Example C.2.1 Let A G £ be a set and consider the indicator function of A 


1a(x) = 


1, if x E A 
0, if x ^ A. 


Then 1 q is measurable (with respect to £). 


Example C.2.2 A function / : E -E F is called simple if it is a linear 
combination of indicator functions 


f(x) = ajeR,Aj€E. 

3 = 1 


Any simple function is measurable. 



Appendix 


699 


Given some measurable functions, one can use them to construet more 
measurable functions, as in the following: 

1. If / and g are measurable, then / db g, f • g, min(/, g), max(/, g) are 
measurable. 

2- If (/„)„ is a sequence of measurable functions, then inf / n , sup / n , lim inf / n , 
limsup/ n , and lim f n (if exists) are all measurable. 

3. If / is measurable, then / + = max(/, 0) and f~ — — min(/, 0) are mea¬ 
surable. 

It can be shown that any measurable function is the limit of a sequence 
of simple functions. If the function is bounded, the same bound applies to 
the simple functions. 

C.3 Measures 

A measure is a way to assess a body of information. This can be done using 
a mapping fi : £ -G R + U {oo} with the following properties: 

(i) n(0) = 0; 

(ii) /a((J n>1 A n ) — X^n>i /^(An), for any disjoint sets Aj in £ (countable 
additivity). 

If (E, J 7 ) is a measurable space associated with the States and configu- 
rations of a brain, respectively, then a measure fi is an evaluation system of 
beliefs; each brain conhguration A G £ is associated with an intensity p(A). 
The triplet (E,£,p) is called a measure space. 

Example C.3.1 (Dirae measure) Let x G E be a hxed point. Then 

if x G A 

if x ^ A, \/A G £ 

is a measure on £. 

Example C.3.2 (Counting measure) Let D C E be hxed and finite. 
Define 

p(A) — card(A D D) = S X (A), MA G £. 

xeD 

/i is a measure on £. It counts the number of elements of A which are in D. 

Example C.3.3 (Discrete measure) Let fi C £ be fixed and discrete. 
Define m(x) be the mass of x, with m(x) > 0, for all x G D. Consider 

p(A) — mass(A) — m(x)S x (A ), MA G £. 

xeD 

/i is a measure on f, which assess the mass of A. 




700 


Deep Learning Architectures, A Mathematical Approach 


Example C.3.4 (Lebesgue measure) Consider E — R and the ©-algebra 
£ — £>m. fi is defined for open intervals as their lengths, /i(a, b ) = \b — a\. It 
can be shown that there is a unique measure that extends fi to £>m, called the 
Lebesgue measure on R. In a similar way, replacing lengths by volumes and 
open intervals by open hypercubes, one can define the Lebesgue measure on 
R n . 

Example C.3.5 (Borei measure) Let B^n be the 6-algebra generated by 
all the open sets of R n . A Borei measure is a measure fi : B^n —> M. In 
the case n — 1, /i becomes a Borei measure on the real line. If fi is a finite 
measure, we associate to it the function F(x) — /i(— oo, x], called the cumula- 
tive distribution function, which is monotone increasing function, satisfying 
/i(a, b] — F(b)—F(a). For instance, the Lebesgue measure is a Borei measure. 

Example C.3.6 (Baire measure) Let K C R n and denote by C°(K) the 
set of all continuous real-valned functions with compact support (which van- 
ish outside of a compact subset of K). The class of Baire sets , £>, is dehned 
to be the ©-algebra generated by {x; /(x) > a}, with / G C°(K). A Baire 
measure is a measure dehned on £>, such that p(C) < oo, for all compact 
subsets C C K. It is worth noting that for K C R n the class of Baire sets is 
the same as the class of Borei sets. In particular, any finite Borei measure is 
a Baire measure. 

Proposition C.3.1 (Properties of measures) Let (E',f,/i) be a measure 
space. The following hold: 

(i) finite additivity: 

= 0 =^> /i(A UB) = /i(A) + /i(S), VA, B e £\ 

(ii) monotonicity: 

Ad B ==> p(A) < p(B), VA, B G £\ 

(iii) sequential continuity: 

A n X A =^> /j(A n ) X n(A ), n —> oo; 

(iv) Boole’s inequality: 

A n ^ < /i(A n ), VA n G £. 

n n 



Appendix 


701 


If /Li and A are measures on ( E , £), then /i + A, cp and c\fi + C 2 A are also 
measures, where c, C{ G M+. 

Let (E,£,p) be a measure space. Then /i is called: 

•finite measure if /a(E) < 00 ; 

• probability measure if p(E) — 1; 

• &-finite measure if /i(E n ) < 00 , where ( E n ) n is partition of E , with 
E n G £; 

• E-finite measure if /i = with /i n hnite measure. 

For instance, the Lebesgue measure is S-hnite but it is not hnite. 

A set M G £ is called negligible if /i(M) — 0. Two measurable functions 
f,9 : ^ -G M are equal almost everywhere, i.e., f — g a.e., if there is a 
negligible set M such that f(pc) — g(x) for all x G E\M. 

C.4 Integrat ion in Measure 

Let (12, £,/i) be a measure space and / : £ G R be a measurable function. 
The object 

m(/) = / f( x )n(dx) = [ fd /1 
J E J E 

represents an evalnation of / though the System of beliefs /i and it is called the 
integral of / with respect to the measure /i. This is dehned by the following 
sequence of steps: 

( i ) if / is simple and positive, i.e., if / = w i^An then dehne 

n 

M/) = ^Win(Ai). 

i —1 

(ii) If / is measurable and positive, then there is a sequence (f n ) n of simple 
and positive functions with f n f. In this case, dehne p(f) — lim n p(f n ). 

(iii) If / is measurable, then let / = / + — f~ and dehne 

m(/) = Kf + ) - Kf~)- 

if m(/) < 00 , then the measurable function / is called integrable. The non- 
negativity, linearity, and monotonicity properties of the integral are, respec¬ 
ti vely, given by 

1. /i(/) > 0 for f : E -A R+; 

2. /i(a/ + bg) — a/i(f ) + b/a(g ), for all a, b G M; 

3. If / < g then //(/) < mO). 

Example C.4.1 Let be the Dirae measure sitting at x. The integral of 
the measurable function / with respect to the Dirae measure is S x (f) — f(x). 



702 


Deep Learning Architectures, A Mathematical Approach 


Example C.4.2 The integral of the measurable function / with respect to 
the discrete measure /i = m ( x )^ is given by /i(/) = m(x)f(x). 

xeD 

Example C.4.3 Let E — R, £ — £>m and /i be the Lebesgue measure on R. 
In this case /i(/) = / f(x) dx is called the Lebesgue integral of / on E. 

Je 

The integral of / over a set A E £ is debned by 

/ / d[i = f fl A d/i =/i(f 1 A )- 
Ja Je 

In particular, we have 



dfi = n(l A ) = n(A), 


VA E S. 


We provide next three key tools for interchanging integrals and limits. 


Theorem C.4.1 (Monotone Convergence Theorem) Let ( f n )n be a 
sequence of positive and measurable functions on E such that f n /* f. Then 

lim 

n—oo 

Theorem C.4.2 (Dominated Convergence Theorem) Let ( f n ) n be a 
sequence of measurable functions with \f n \ < g, with g integrable on E. If 
lim n f n exists, then 

lim 

n—oo 

Theorem C.4.3 (Bounded Convergence Theorem) Let ( f n ) n be a 
bounded sequence of measurable functions on E with g{E) < oo. If lim f n 

n—^oo 

exists, then 

lim 

n—^ oo 

We make the remark that two measurable functions eqnal almost every- 
where have equal integrals, i.e., the integral is insensitive at changes over 
negligible sets. 


j fn dfi — f f dn. 


f fndn = f f d/2. 


f f n d/d = f f dn. 



Appendix 


703 


C.5 Image Measures 

Let (F, F) and (F,£) be two measurable spaces and h : F -A E be a mea- 
surable function. Any measure v on (F, F) induces a measure /i on (F,£) 
by 

p(B) = u(h~ 1 (B)), VBe£. 

The measure p = uoh~ l is called the image measure of v under h. It is worth 
noting that if / : E -A R is measurable, then the above relation becomes the 
following change of variable formula: 

[ f(y)dp(y)= [ f(h(x)) du(x), 

Je Jf 

provided the integrals exist. 

Remark C.5.1 If (F, E) and (F,£) represent the pairs (states, conhgura- 
tions) for Ann’s and Bob’s brains, respectively, then the fact that h is mea¬ 
surable means that “anything Bob can think, Ann can understand.” The 
measure fi is a System of beliefs for Bob, which is induced by the System of 
beliefs v of Ann. 


C.6 Indefinite Integrals 

Let (F, £ , /i) be a measure space and p : E —> R + measurable. Then 

v(A) — / p{x) d/i(x), A e £ 

J A 

is a measure on (F, £\ called the indefinite integral of p with respect to /i. It 
can be shown that for any measurable positive function / : F —> R + we have 

/ f(x)dv{x)= / f(x)p(x)dfi(x). 

Je Je 

This can be informally written as p(x)dp(x) — du(x). 

C.T Radon-Nikodym Theorem 

Let /i, v be two measures on (F, £) such that 

/i(A) = 0 v(A) = 0 , A e £. 

Then v is called absolutely continuous with respect to /i. The following resuit 
states the existence of a density function p. 



704 


Deep Learning Architectures, A Mathematical Approach 


Theorem C.7.1 Let p be &-finite and v be absolute continuous with respect 
to p. Then there is a measurable function p : E -G R + such that 

/ f(x)dv(x)= / f(x)p(x)dp(x), 

Je Je 

for all measurable functions f : E —> R + . 

The previous integral can be also written informally as p{x)dp{x) — dv(x). 
The density function p — jy is called the Radon-Nikodym derivative. 

Remark C.7.2 This deals with an informal understanding of the theorem. If 
the measures p, v are considered as two Systems of evalnation, the fact that v 
is absolute continuous with respect to p means that the System of evaluation 
v is less rigorous than the System p. This means that the mistakes that are 
negligible in the System n, i.e., p(A) — 0, also pass undetected by the System 
/i, i.e. v(A) — 0. Under this hypothesis, Radon-Nikodym theorem States that 
the more rigorous System can be scaled into the less rigorous System by the 
relation dv — pdp. The density function p becomes a scaling function between 
the Systems of evaluation. 


C.8 Egorov and Luzin’s Theorems 


Egorov’s theorem establishes a relation between the almost everywhere con- 
vergence and uniform convergence. 

Let (E,£,p) be a measure space and f n : E —> R U {oo} be a sequence 
of extended real value functions. We say that f n converges a. e. to a limit / 
if fn ( x ) —> /(x), as n —> oo, for all x G E\N, with N a //-negligible set. 

We say that the sequence (f n ) converges uniformly to / on A if 3uq > 1 
such that 


fn(x ) - f(x) 


< e, 


Vn > no, Vx G A. 


Theorem C.8.1 (Egorov) Let E be a measurable set of finite measure, and 
f n a sequence of a. e. finite value measurable functions that converge a. e. on 
E to a finite measurable function /. Then for any e > 0 ; there is a measurable 
subset F of E such that p(F) < e and (f n ) converges to f uniformly on E\F. 


Loosely stated, any a.e. pointwise convergent sequence of measurable 
functions is uniformly convergent on nearly all its domain. 

A consequence of this resuit is Luzin’s Theorem, which States that a mea¬ 
surable function is nearly continuous on its domain, and hence a continuous 
function can be approximated by measurable functions. 



Appendix 


705 


A measure p on (R n , £>m™) is called regular if: 

p(A) — m£{p(U)]A C t/, C/open in R n } = sup{//(T7); y C A, I/open in R n }. 

This means that the measure structure is related with the topological struc¬ 
ture of the space. 

Theorem C.8.2 If f : I n — [0, l] n —> R is a measurable function, then for 
any e > 0 there is a compact set K in I n such that p(I n \K ) < e and f is 
continuous on K, where p is a regular Borei measure. 

C.9 Signed Measures 

The difference of two measures is not in general a measure, since is not 
necessarily nonnegative. In the following we shall deal with this concept. 

A signed measure on the measurable space (22, £) is a map 

v : £ R U {Aoo} 

such that 

(i) v assumes at most one of the values —oo, +oo; 

(ii) i/(0) = 0; 

(iii) |^J Aj^J — for any sequence (Af) of disjoint sets in £ (i.e., 

i> 1 i> 1 

v is countable additive). 

Example C.9.1 Any measure is a signed measure, while the reverse does 
not hold true in general. 

Example C.9.2 Any difference of two measures, p — v\ — is a signed 
measure. The reverse of this statement also holds true, as we shall see shortly. 

Let v be a signed measure. A set G in £ is called a positive set with 
respect to v if any measurable subset of G has a nonnegative measure, i.e., 

v(G n A) >0, MA G £. 

A set F in £ is called a negative set with respect to v if any measurable subset 
of F has a nonpositive measure, i.e., 

v(F n B) < 0, MB G £. 

A set which is simultaneously positive and negative with respect to a signed 
measure is called a null set We note that any null set has measure zero, while 
the converse is in general false. 



706 


Deep Learning Architectures, A Mathematical Approach 


Propositiori C.9.3 (Hahn Decomposition Theorem) Let v be a signed 
measure on the measurable space (E, E). Then there is a partition of E into 
a positive set A and a negative set B, i.e., E — AU B ? An B — 0. 

It is worth noting that Hahn decomposition is not unique. However, for any 
two distinet decompositions {Ai, B\} and {A 2 , B 2 } of E it can be shown that 

v(F n A\) — v(F D A 2 ), v{F n Bi) — v(F D F 2 ), VF G S. 

This suggests to define for any Hahn decomposition {A, B } of E two mea- 
sures, zv + , zv _ , such that 

zv + (F) = v(F n A), 1 /-(F) = -v{F n B ), VF G E. 

We note that v — v+ — v~ . We also have v+(B) — v~(A) — 0, with {A, B} 
measurable partition of E. A pair of measures zv + , v~ with this property is 
called mutually singular. 

Propositiori C.9.4 (Jordan Measure Decomposition) Let n be a signed 
measure on the measurable space (F, E). Then there are two measures v+ and 
v~ on (F,£) such that v — v+ — v ~. Furthermore, if v + and v~ are mutually 
singular measures, the decomposition is unique. 

The measures and v~ are called the positive part and negative part of zq 
respectively. Since v takes only one of the valnes ± 00 , then one of the parts 
has to be a finite measure. The sum of measures \v\ — v+ -f- v~ is a measure, 
called the absolute value of v. 

The integration of a measurable function / with respect to a signed mea¬ 
sure v is defined as 

/ ^ ' dv ' = I f du+ ~ / ^ dv ~ 1 

provided / is integrable with respect to \v\. Furthermore, if |/| < C, then 

[ f dv < C\v\(E). 

J E 

Example C.9.1 Let g be an integrable function on the measure space (F, £, v). 
The measure 

v{F) — / g{x) dv(x), VF G E 

JF 

is a hnite signed measure. The decomposition g — g+ —g~ yields v — v+ — v~, 
where 

v^(F) — / g+(x)dv(x), v~ (F) = / g~(x)dv(x). 

Jf Jf " 



Appendix 


707 


Remark C.9.5 If /i is a signed finite measure on the measurable space 
(fi, J 7 ), define its total variation as 

n 

IHItv = su pX1 \v( A i) 

i =1 

over all finite partitions with disjoint sets, fi = IJILi This set of measures 
with the norm || • \\tv forms a Banach space. 



Appendix D 

Probability Theory 


The reader interested in probability can consuit for details the book [23]. 


D.l General definitions 


A probability space is a measure space P), where P is a probability 

measure, i.e., a measure with P(f2) = 1. The set is the sample space, 
sometimes regarded also as the States of the world; it represents the outcomes 
set of an experiment. The ©-algebra T~L is the history information; each set 
H E T~L is called an event. The probability measure P evaluates the chance of 
occurrence of events. For each H E 77, the nnmber P (H) is the probability 
that H occurs. 

A random variable is a mapping X : Q -A R which is (77, £>M)-measurable, 
i.e., / -1 (£>r) C 77. This means that for each experiment outcome uj E 17, X 
assigns a number X(uj). 

The image of the probability measure P through A, (i = P o X 1 , is a 
measure on (R, £>m), called the distributiori of the random variable X. More 
precisely, we have 

/i(A) = P(X eA)= P(cj; X(u) eA)= PlAr^A)). 

The measure p describes how is X distributed. The function 


F{x) — /i(—oo, x 


F(X < x ) 


is called the distributiori function of X. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10T00T/9T8-3-030-36721-3 


709 



710 


Deep Learning Architectures, A Mathematical Approach 


Let (0,7/, P) be a probability space and (X,£), (F, X) be two measur- 
able spaces. If X : D -G E and / : E -G F are measurable, consider their 
composition Y = / o X. If /i is the distribution of X, then zv = /i o / _1 is the 
distribution of y. We can write 

z/(B) = P(y E B) — P(/ o X E B) — P(X G 

= / i(r 1 (5)) = o°r 1 )(5), 


D.2 Examples 


Bernoulli Distribution A random variable X is said to be Bernoulli dis- 
tributed with parameter p G [0,1], if X G {0,1}, with P{X — 1) = p and 
P(X — 0) = 1 — p. The mean and variance are given, respectively, by 


E[X] = 1 • P(X = 1) + 0 • P(X = 0 )=p, 

Var(X) = E[X 2 ]-E[X] 2 = 1 2 -P(X = l)+0 2 -P(X = 0 )-p 2 = p-p 2 = p(l-p). 
We shall write X ~ Bernoulli(p). 

Normal Distribution A random variable X is called normal distributed 
with parameters fi and a if P(X < x) — f-oc f ( X ) with 



1 (x-n) 2 

7^ e 2ct2 ’ 

V Z7T(J 


The mean and variance are given by 


x G R. 


E[X] = /i, Var(X) = a 2 . 


We shall use the notation X ^ A/"(/i, cr 2 ). 


D.3 Expectation 

The expectation of the random variable X is the assessment of X though the 
probability measure P as 

E[X]=P(X)= f X(u)dF(u). 

J UJ 

If fi is the distribution measure of X then the change of variable formula 
provides 

E [f(X)}= [ f(X(oj))dF(oj)= [ f(y) dP(X~ 1 y) = [ f(y)dp(y) 

J ei jm. jr 



Appendix 


711 


In particular, if ju is absolute continuous with respect to the Lebesgue measure 
dy on R, then there is a nonnegative measurable density function p(y) such 
that dp(y) — p(y)dy. Consequently, the previous formula becomes 

E [fi x )} = [ f{y)p{y) d v■ 

JR 

The expectation operator is nonnegative, monotonic, and linear operator, i.e., 

(i) X > 0 => E[X] > 0; 

(ii) X>Y =>E[X] <E[y]; 

(iii) E [aX + bY} — aE[X] + ME[T] for a, b E R. 

D.4 Variance 

Let /i = E[X] be the mean of X. The variance of the random variable X is 
dehned as 

Var(X) = E [(X -/i) 2 ]. 

The variance is a measure of deviation from the mean in the mean square 
sense. If let p(x) denote the probability density of X, physically, the variance 
represents the inertia momentum of the curve y — p(x) about the vertical 
axis x = /i. This is a measure of easiness of revolntion of the graph y — p(x) 
about a vertical axis passing through its mass center, p. 

In general, the variance is neither additive nor multiplicative. However 
in some particular cases it is, as we shall see next. The covariance of two 
random variables X and Y is dehned by Cov(X,Y) — E[X7] — E[X]E[T]. 
The variance of the sum is given by 

Var(X + Y) = Var(X) + 2 Cov(X, Y) + Var(Y). 

If X and T are independent, then Var(X-\-Y) — Var(X) + Var(Y). The vari¬ 
ance is also homogeneous of order 2, i.e., Var(cX) — c 2 var(X ), for all c G R. 

We also have the following exact expression for the variance of a product 
of independent random variables, see [49]. 

Lemma D.4.1 (Goodman’s formula) If X and Y are two independent 
random variables, then 

Var(XY) = E [X} 2 Var{Y) + E[Y] 2 Var(X) + Var(X)Var(Y). 

In particular, if E[X] — E[Y] = 0, then 

Var(XY) = Var(X)Var(Y). 

Proof: Denote the right side by 

R = E [X] 2 Var(Y) + E [Y] 2 Var{X) + Var(X)Var(Y). 






712 


Deep Learning Architectures, A Mathematical Approach 


The definition of variance and some algebraic manipulations together with 
the independence property provides 


R = Var(Y)[E[X} 2 + Var(X)] +E[Y] 2 Var(X) 

= Var{Y)E[X 2 ]+E[Y] 2 Var{X) 

= [E[Y 2 ] - E[y] 2 ] E[X 2 ] + E[F] 2 [E[X 2 ] - E[X] 2 ] 

= e[y 2 ]e[x 2 } - e[y} 2 e[x} 2 = E[x 2 y 2 ] - e[xy} 2 

= Var(XY), 


which recovers the left side of the desired expression. ■ 

Variance approximation Let m = E[X] and / be a differentiable function. 
Linear approximation about x — m provides 

f(x) — f(m) + f'(m)(x — rn) + o(x — m) 2 . 


Then replace the variable x by the random variable X to obtain 


/00 « f(m) + f'(m)(X - m). 


Taking the variance on both sides and using its properties we arrive at the 
following approximation formula: 

Var{f{X )) » f'(m) 2 Var(X), (D.4.1) 

provided that / is twice differentiable and that the mean m and variance of 
X are finite. Therefore, a small variance of X produces a small variance of 
/(X), provided f' is bounded. 


D.5 Information generated by random variables 

Let X : D —>> M be a random variable. The information field generated by X 
is the S-algebra 

e(x) = x~ 1 (B R ). 

Let X : D R be a random variable and / : R —> R be a measurable func¬ 
tion. Consider the random variable Y — f(X). Then Y is ©(X)-measurable, 
i.e., 6(Y) C &(X). Equivalently stated, the information generated by X 
determines the information generated by Y. The converse statement holds 
also true. The proof of the following resuit can be found in Cinlar [23], Propo- 
sition 4.4, p.76. 

Propositiori D.5.1 Consider two random vector variables X,Y : D —>► R. 
Then &(Y) C ©(X) if and only if Y is determined by X, i.e., there is a 
measurable function f such that Y = /(X). 

This can be stated also by saying that Y is determined by X if and only if the 
information generated by X is finer than the information generated by Y. 




Appendix 


713 


It is worth noting that the previous function / is constructed by a limiting 
procedure. The idea is given in the following: for a hxed n we consider the 
measurable set 


A — V 

sT-m,n — 1 


1 


L 2 


m m + 1 

5 


m 


)n 


G &(Y) C 6 (X), m = 0, ±1, ±2. 


so that A m?n — 
function f n (x) — 


X with L> m?n measurable. Construet the simple 

— 1R™ „(x). It can be shown that 

9 n JDr rn,ri V / 


m 


/«(*)< y </„(*) + L 

Then we choose / = lim / n , which is measurable, as a limit of simple fune¬ 
ri —>oo 

tions. 


Example D.5.1 Let u — (lj i, ^ 2 ,^ 3 ) G and dehne the random variables 
Xi,Yi : fi -G M, i = 1, 2, 3 by 

Xi{u)=ui, X 2 {u)=uj 2 , X 3 (a;) = ce 3 , 

Tl(cj) = CJi — CJ 2 , T 2 (cj) = LJi + CJ 2 , I 3 M = ^1 + ^2 + ^3- 

Then 6 (Xi,X 2 ) = 6 (Yi,T 2 ) and &(X 1 ,X 2 ,X 3 ) = ©(Yi,T 2 ,T 3 ), but 
6 (X 2 ,X 3 ) 7 ^ ©( 12 , 13 ), since Y 3 cannot be written in terms of X 2 and X 3 . 

A stochastic process is a family of random variable (. Xf)teT indexed over 
the continuous or discrete parameter t. The information generated by the 
stochastic process ( X t )teT is the smallest ©-algebra with respect to which 
each random variable Xt is measurable. This can be written as 

G = &(X t ;t E T) = ©( U &X t ) = V 6 (V). 

teT ter 


D.5.1 Filtrations 

Let (O, %, P) be a probability space. The ©-algebra T~L can be interpreted 
as the entire history of the States of the world Vt. The information available 
until time t is denoted by Xt. We note that the information grows in time, 
i.e., if s < t, then X s C Xt. An increasing flow of information on %, (. Xt)teT > 
is called a filtration. 

Each stochastic process, (X t )teT, dehnes a natural filtration 

X t = 6 (A S ; s < t), 

which is the history of the process up to each time instance t. In this case each 
random variable X t is J^-measurable. Stochastic processes with this property 
are called adapted to the filtration. 



714 


Deep Learning Architectures, A Mathematical Approach 



Figure 1: In a feedforward neural network the information flow satisfies the 
sequence of inclusions Tz C Tu C Ty C Tx, which forms a filtration. 


Example D.5.2 Consider the feedforward neural network given by Fig. 1 , 
which has two hidden layers, Y — (Yi,Yi,Ys) T , U — (UyU2) T - Its input is 
given by the random variable X — (Xi, X2) T and the output is given by Z. 
The input information is the information generated by X, given by the sigma- 
algebra Tx — ©(Xi,X 2 ). The information in the hrst and second hidden 
layers is given by Ty — 0(Yi, Y 2 , 13 ) and Tu — ©([/]_, U2), respectively. The 
output information is Tz — 6 (2’). Since Yj are determined by X^, it follows 
that Yj are Xx-nieasurable, fact that can be written also as Ty C Tx- 
Similarly, we have the following sequence of inclusions: 

Tz c Tu c Ty c Tx- 

This natural filtration describes the information flow through the network. 


D.5.2 Conditional expectations 


Let X be a random variable on the probability space (f2,H,P) and consider 
some partial information T C T~L. The random variable determined by J 7 , 
which is “the best approximator” of X is called the conditional expectation 
of X given T. This is dehned as the variable X that satisfies the properties: 

(i) X is X-measurable; 



X dF = / X dP, \/A G T. 


It can be shown that in the case of a sqnare integrable random variable X, 
we have 


X-X 



X - 2 


2 

5 


'iz g St, 


where Sy = {/ E L 2 (fl);/ is T— measurable}. This means that X is the 
orthogonal projection of X onto <Sjr, he., it is the best approximator of X in 
the mean sqnare sense with elements from Sy. In the previous relation the 
norm is the one induced by expectation of the square, 11X11 2 = E[X 2 ]. 















Appendix 


715 


D.6 Types of Convergence 

In the following P) will denote a probability space and X n : Tt —> R 

will be a seqnence of random variables. 

D.6.1 Convergence in probability 

The sequence converges in probability to X if lim n ^ 00 P(|X n —X| < e) = 1. 
This type of convergence has the following interpretation. If X denotes the 
center of a target of radius e and X n the location of the n-th shoot, then 
{\X n — X\ < e} represents the event that the n-th shoot hits the target. 
Convergence in probability means that in the long run the chance that the 
shoots X n will hit any target centered at X with arbitrarily hxed radius e 
approaches 1. 

D.6.2 Almost sure convergence 

X n converges almost surely to X if lim n ^ 00 P(cj; X n {<jj) —> X(cu)) = 1. This 
means that for almost any state uj E fl, the sequence X n (cv ) converges to 
X(ca) as a seqnence of real numbers, as n —> oo. 

The following results provide necessary conditions for a.s. convergence. 

Proposition D.6.1 (Borel-Cantelli Lemma I) Assume 

£P(\X n -X\>e)<oo, Ve > 0. 

n>l 

Then X n —> X almost surely. 

Proposition D.6.2 (Borel-Cantelli Lemma II) Suppose there is a sequence 
(e n ) decreasing to 0 such that 

^ ^ P(\X n X 6 n ) <C oo. 

n> 1 

Then X n —> X almost surely. 

It is worth noting that the almost sure convergence implies convergence 
in probability. 

D. 6.3 ZZ-conver gence 

The sequence X n converges to X in ZT-sense if X n , X E Z 2 (0) and 

E(|X n — X\ p ) -G 0, n -G oo. 



716 


Deep Learning Architectures, A Mathematical Approach 


It is worth to note that Markov’s inequality 


P(\X n 



>e)< 



eP 


implies that the IT-convergence implies the convergence in probability. 

If p — 2, the convergence in L 2 is also called the mean square convergence. 

A classical resuit involving all previous types of convergence is the Law 
of Large Nnmbers: 


Theorem D.6.3 Consider X i, X 2 , X 3 • • • independent, identically distributed 
random variables, with mean E[X/] = a and variance Var[Xj} — b, both 
finite. Let X n — \{X\ + • • • + X n ). Then X n converges to a in mean square, 
almost surely and in probability. 


D.6.4 Weak convergence 

Consider the random variables X n , X , with distribution measures /i n , /i, 
respectively, and denote 

Q = {/ i R — y R5 /continuous and bounded^. 

The sequence X n converges to X in distribution if 

E[/ oX n } 4 E[/oI], n^oo V/ G C h . 

The measures p n converge weakly to fi if 

Mn/ m/, n 00 V/ G C b . 


The relation 

m(/) = / f(x) dn(x) = [ f O X dF = E[f o X} 

J Jn 

shows that the convergence in distribution for random variables corresponds 
to the weak convergence of the associated distribution measures. 

Another equivalent formulation is the following. If px(f) — E[e 2tx ] denotes 
the characteristic function of the random variable X , then X n converges to 
X in distribution if and only if px n (t) —> tyxif) f° r a ll t G R. If g denotes 

the distribution measure of A, then <p x (t) = [ e itx dp(x). This is called the 

it 

Fourier transform of /i. Furthermore, if /i is absolute continuous with respect 
to dx, then X has a probability density pix), i.e., du(x) — pix) dx and then 

<px(t) = f e itx p{x) dx. This is called the Fourier transform of p{x). 

Jr 



Appendix 


717 


A useful property is the injectivity of this transform, i.e., if cpx = 0 then 
p — 0. A heuristic explanation is based on the concept of frequency. More 
precisely, the Fourier transform of a time-domain signal p(x) (i.e., p{x) is the 
amplitude of a signal at time x) is a signal in the frequency domain, given 
by <px(t) (i - e -5 Txif) is the amplitude of the signal at frequency i). Now, if 
the Fourier transform vanishes, i.e., <px(t) — 0 for each t, this implies the 
absence of a signal regardless of its frequency. This mnst be the zero signal, 

p — 0. 

It is worthy to note that all previous mentioned types of convergence 
imply the convergence in distribution. The following classical resuit uses this 
type of convergence. 


Theorem D.6.4 (Central Limit Theorem) Let Xi,X 2 ,X 3 ,... be inde- 
pendent, identically distributed random variables, with mean E[Xj] = a and 
variance Var(Xj) — b, both finite. Denote Z n — Sr yT a , with S n = X\ + • • • + 

X n . Then Z n tends in distribution to a Standard normal variable, £ N( 0,1). 

This can be also stated as the convergence of the distribution function 


lim P ( Z n < x 

n—»• oo 



1 ,.2 


(X) 


C2 


u du. 


7T 


D.7 Log-Likelihood Function 

This section provides some functional equation for the logarithmic, exponen- 
tial, and linear functions. 

The existence of a function which transforms prodncts of numbers into 
sums is of uttermost importance for being able to define the concept of Infor¬ 
mation. 

Let s be an event with probability P(s). The information casted by s is 
large when the probability P(s) is small, i.e., when the event is a snrprise. 
Then the information contained in s is given by a function of its probabil¬ 
ity, g(P(s )), where g is a positive decreasing function, with g(0+) = +oo 
and g( 1—) = 0. This means that event s with zero probability have infinite 
information, while events which happens with probability 1, practically, con- 
tain no information. Furthermore, if si and S 2 are two independent events 
with probabilities of realization P{s{) — 7Ti and P(s 2 ) — vr 2 , respectively, by 
heuristic reasoning, the information prodnced by both events has to be the 
sum of individual information produced by each event, i.e., 

g{P{si n s 2 )) = g{i ri7T 2 ) = g{ vri) + g( tt 2 ). 

The next proposition shows that the only function satisfying these prop- 
erties is g{x) — — lnx, where the negative sign was included for positivity. 



718 


Deep Learning Architectures, A Mathematical Approach 


Propositiori D.7.1 Any differentiable function f : (0, oo) —> R satisfying 

f(xy) = f(x) + f(y), Vx,ye(0,oo) 
is of the form /(x) — clnx , with c real constant. 


Proof: Let y — 1 + 6, with e > 0 small. Then 

f(x + xe) = f(x) + /(1 + e). 

Using the continuity of /, taking the limit 

lim /(x + xe) — /(x) + lim /(1 + e) 
c — a) c — 

yields /(1) = 0. This implies 

lim PDA = lim fO + O-m = f{l) 

e^0 e e^0 e 

Equation (D.7.2) can be written equivalently as 

f(x + xe)-f(x) /(1 + e) 


xe 


xe 


(D.7.2) 


Taking the limit with e —> 0 transforms the above relation into a differential 
equation 

f\x) = 1/(1). 
x 

Let c = f'( 1). Integrating in f'(x) = £ yields the solntion f(x) = clnx + iL, 

00 

with K constant. Substituting in the initial functional equation we obtain 

K = 0. ■ 


Propositiori D.7.2 Any differentiable function f : R —> R satisfying 

f(x + y) = f(x) + f(y), 

is of the form f(x) = ex, with c real constant. 


Proof: If let x — y — 0, the equation becomes /(0) = 2/(0), and hence 
/(0) = 0. If take y = c, then f(x + e) — f(x) — /(e). Dividing by e and taking 
the limit, we have 


lim 

e —^0 


f(x + e) - /(x) 
e 


lim 

e —^0 


m - /(o) 

5 

e 


which is written as f\x) — f'( 0). On integrating, yields /(x) = ex + 6, with 
c = / 7 (0) and b real constant. Substituting in the initial equation provides 
b — 0. Hence, smooth additive functions are linear. ■ 








Appendix 


719 


Propositiori D.7.3 Any differentiable function f : R —> (0, oo) satisfying 

f( x + y) — f(x)f(y), Vx,j/eK 
is of the form f(x) — e cx , with c real constant. 

Proof: Applying the logarithm function to the given eqnation, we obtain 
g(x + y) = g(x) + g(y), with g(pc) — ln(/(#)). Then Proposition D.7.1 yields 
g(x) — cx , with cgR. Therefore, /(rr) = e— e cx . ■ 

Remark D.7.4 It can be proved that Propositions D.7.2 and D.7.3 hold for 
the more restrictive hypothesis that / is just continuous. 


D.8 Brownian Motion 


Section 4.13.2 uses the notion of Brownian motion. For a gentle introduction 
to this subject the reader can consuit [20]. For more specialized topics, the 
reader can consuit [37]. 

Definitiori D.8.1 A Brownian motion is a stochastic process Wt , t > 0, 
which satisfies: 

(i) Wq — 0 (the process starts at the origin); 

(ii) i/0 < u < t < s, then W s — Wt and Wt~W u are independent (the process 
has independent increments); 

(iii) t —> Wt is continuous; 

(iv) the increments are normally distributed, with Wt — W s ~ A/"(0, \t — s|). 

Note also that E[tf/] — 0, E[VF t 2 ] — t and Cov(Wt- ) W s ) — min{s, t}. 

Ito’s formula If X t is a stochastic process satisfying 

dX t = b(X t )dt + (j(Xt)dW u 

with b and a measurable functions, and F t — f(X t ), with / differentiable, 
then 

dF t = [i b(X t )f'{X t ) + l -a{X t ) 2 f"(X t )}dt + af\X t ) dW t . 

Some sort of converse is given by Dynkin’s formula: 

Consider the Ito diffusion 


dX t = b(X t )dt + a(X t )dW t , X 0 = x. 

Then for any / E CQ(R n ) we have 

C Af(X s )ds 
o 


E x [f(X t )] = f(x)+E* 


5 




720 


Deep Learning Architectures, A Mathematical Approach 


where the conditional expectation is 


E x [f(Xt)}=E[f(X t )\X 0 


and A is the infinitesimal generator of Xf. This 


Af(x ) — lim 

t\o 


mf(Xt)} - 

t 


— x\ 
means 
/(*) 



Appendix E 

Functional Analysis 


This section presents the bare bones of the functional analysis results needed 
for the purposes of this book. The reader interested in more details is referred 
to Rudin’s book [106]. 


E.l Banach spaces 


This section deals with a mathematical object endowed with both topological 
and algebraic structure. Let (T,+,-) be a linear vector space, where “+” 
denotes the addition of elements of X and is the multiplications with real 
scalars. 

A norm on X is a real-valued function 
properties: 


X R satisfying the following 


(i) \\x\\ > 0, with 

(**) 


OLX\\ — \OL 


X 


x =0 <=^> x — 0; 
, Va E R; 

+ ||y||, € A. 


(m) \\x + y || < 

The pair (T, || ||) is called a normed space. The norm induces the metric 
d(x,y) = \\x — y ||, and thus (V,d) becomes a metric space. A Banach space 
is a normed vector space which is complete in this metric. 

It is worth noting the sequential continuity of the norm: if x n —> x, then 


x 


n 


X 


for n 


oo. 


A 

1. 


x 


few examples of Banach spaces are given in the following: 

The n-dimensional real vector space W 1 together with the Euclidean norm, 
| = (x\ + • • • + x^) 1 / 2 , forms a Banach space. 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


721 


































722 


Deep Learning Architectures, A Mathematical Approach 


2. Let K C R n be a compact set. The space C(K ) of all continuous real-valued 

functions on K with the norm ||/|| = max \f(x)\ forms a Banach space. 

xeK 

3. Let p > 1 and consider L p [ 0,1] = {/; \ f\p < oo}, where / represents a 

class of all measurable functions which are equal a.e. This is a vector space 
with respect to addition of functions and multiplication by real numbers. The 

norm is given by ||/|| = \\f\\ p — ^ /q \f\ p ^j • The fact that ||/|| = 0 implies 

f — 0 a.e. is consistent with the dehnition of the space as a space of classes of 
functions equal a.e. L p [ 0,1] is a Banach space by the Riesz-Fisher theorem. 

4. Consider L°°[0,1] to be the space of a.e. bounded measurable functions on 
[0,1]. This is a vector space with respect to addition of functions and multi- 
plication by real numbers. The norm is given by ||/|| = \\f\\oo — inf sup g. 

9=f a.e. 


E. 2 Linear Operators 

The reader interested in this topic is referred to the comprehensive book [36] . 
Let T and y be two vector spaces. A mapping T : T -G y is called a linear 
operator if 

T(a\xi + 02X2) — a\T(xi) + a2T(x2), Vcq G R ,\/xi G T. 

Assume now that T and y are normed vector spaces. The linear operator T 
is called bounded if there is a constant M > 0 such that 


Tx || < M||x||, 


Mx G X. 


The smallest such M is called the norm of the operator , and it is denoted by 
||T||. We also have the equi valent defmitions 



\\Tx\ 
sup ——- 


sup ||Tx|| = sup \\Tx\ 

x\\=l \\x\\<l 


Since 


Tx 1 — Tx 2 1| < M||xi — X 2 II, 


Mx 1, X2 G T, 


it follows that a bounded linear operator is uniformly continuous, and hence 
continuous. Conversely, if the linear operator T is continuous at only one 
point, then it is bounded. 


It can be shown that the space of bounded linear operators T : T —>> y , 
with y Banach space is also a Banach space. 

If the space y = R, the linear operator T is called a linear functional. In 
particular, the space of bounded linear functionals forms a Banach space. 










Appendix 


723 


E.3 Hahn-Banach Theorem 


A convex functional on the vector space V is a function p : A —> R such that 

(i) p(x + y) < p(x) + p(y), i.e., p is subadditive; 

(ii) p(ax ) = ap(x) for each > 0, i.e., p is positive homogeneous. 


Example E.3.1 Let A = R n and consider p(x) — 
(xi,..., x n ). Then p(x) is a convex functional on R n . 


max 

l<i<n 



where x — 


If (A, + , •) is a linear space, a subset Xq C X is a linear subspace if Xq 
is closed with respect to the endowed operations from X. Consequently, Xq 
becomes a linear space with respect to these operations. 

The following resuit deals with the extension of linear functionals from a 
subspace to the whole space, such that certain properties are preserved. 


Theorem E.3.1 (Hahn-Banach) Let X be a linear real vector space, Xq a 
linear subspace, p a linear convex functional on A, and f : Xo R a linear 
functional such that f(x) < p(x) for all x E Xq. 

Then there is a linear functional F : A —> R such that 

(i) Fxq — f (the restriction of F to Xq is f). 

(ii) F(x) < p(x) for all x E X. 


We include next a few applications of the Hahn-Banach theorem. 

1. Let p : A —> [0, +oo) be a nonnegative, convex functional and xq G X be 
a fixed element. Then there is a linear functional F on X such that F(x o) = 
p(x o) and F(x) < p(x) for all x in X. 

2. Let xo be an element in the normed space X. Then there is a bounded 
linear functional F on X such that F(xq) — ||F|| ||xo| • 

3. Let S be a linear subspace of the normed linear space T and y an element 
of T whose distance to S is at least 5, i.e., 




Vs G 5. 


Then there is a bounded linear functional f on X with ||/| i < i, f(y ) = 
and such that f(s) — 0 for all s E S. 


E.4 Hilbert Spaces 

A Hilbert space H is a Banach space endowed with a function ( , ) : H x H —> 
R satisfying the following: 

(i) (x, x) = ||x|| 2 ; 

(ii) (. x,y) = (y,x); 

(iii) (c\X\ + c 2 x 2 ,y) = ci(xi,y ) + c 2 (x 2 ,y), Ci e R, x i: y e H. 



724 


Deep Learning Architectures, A Mathematical Approach 


Example E.4.1 H — R n , with (x,y) — Yli=i x iyi- 


Example E.4.2 H — L 2 [ 0,1], with (x,y) — fj x(t)y(t) dt. 

Cauchy’s inequality States that |(x,y)| < \\x\\ ||y||, so the linear functional 
g(x) — (x,y) is bounded, and hence continuous from H to R. Consequently, 
if x n —> x in i7, then (x n , y) —> (x, y), as n —> oo. 

Two elements x, y E H are called orthogonal if (x, y) = 0. A set Z7 is 
called an orthogonal system if any two distinet elements of U are orthogonal. 

Example E.4.3 The set {1, cos £, sint,..., cosn£, sinnt,... } is an orthogo¬ 
nal system for H — L 2 [— tt,tt]. 

The orthogonal system Z7 is called orthonormal if ||x|| = 1 for all x G U. 
Example E.4.4 The following set 

111 1 1 

cos t, sin t,..., —= cos nt, —= sin nt,... 


27T a/7T 


is an orthogonal system for H — L‘ 


— 7T, 7T 


Let {xi,X 2 ,... } be a countable orthonormal system in H. The Fourier 
coefficient of an element x G H with respect to the previous system is given 
by Ck — (x, Xfc). Then Bessel inequality States that 


Xbfc - 


X 


fc>l 


A linear subspace Ao of a Hilbert space is called closed if it contains the 
limit of any convergent sequence (x n ) in Aq. In this case, for any element x, 
the number d(x,Ao) = inf{11x — y\\;y G Ao} is called the distance from x to 
the subspace Aq. 


Theorem E.4.1 Let Ao a closed linear subspace of the Hilbert space A 
Then for any element x in X there is an element xq G Xq such that \\x — xq 
equals the distance from x to Xq. 


The element xq is called the projection of x onto the subspace Xq. 


E. 5 Representation Theorems 

The present section provides representations of linear functionals on different 
spaces, which are needed for showing the universal approximator property of 
neural networks. Most of them are usually qnoted in literature as u Riesz’s 



Appendix 


725 


theorems of representation.” The reader can find the proofs, for instance, in 
the books of Halmos [51] or Royden [104]. 

The next resuit is a representation of bounded linear functionals on 
Hilbert spaces. 


Theorem E.5.1 (Riesz) Let f : H -G R be a bounded linear functional on 
the Hilbert space H endowed with the inner product (, ). Then there is a 
unique element y G H such that f{pc) — (x,y) for all x G H. Furthermore, 

ii/ii = \\ x 


R is a bounded linear functional, there is an 

i 


In particular, if F : L 2 [0,1] - 

unique g G L 2 [ 0,1] such that F(f) — / f{t)g{t) dt. 

J o 

Even if L p is not a Hilbert space for p ^ 2 (it is a complete space though) 
this resuit can stili be extended to a representation resuit on the spaces L p . 


Theorem E.5.2 (Riesz) Let F be a bounded linear functional on L p [ 0,1], 
with 1 < p < oo. Then there is a unique function g G L q [ 0,1], with ^ ^ = 1, 


such that F(f) — f(t)g(t) dt. We also have ||F|| = \\g\\ q . 


The previous resuit holds also for the case p — 1 with some small modifi- 
cations. Let L°°[0,1] be the space of measurable and a.e. bounded functions, 


which becomes a Banach space with the supremum norm, 
of Section E.l. 


005 


see point 4 


Theorem E.5.3 Let F be a bounded linear functional on L^O, 1]. Then there 
is a unique function g G L°°[0,1], such that 

F(f) = f f(t)g(t) dt, V/eL 1 !0,1]. 

J 0 


We also have F 


9 


oo 


A real-valned function g defined on the interval [a, b] has bounded variation 
if for any division 

a — xq < x\ < • • • < x n — 6, 


the sum 


71—1 


E \s( x k+i) - g{ x k ) 


k =0 


is smaller than a given constant. The superior limit of these snms over all 
possible divisions of [a, b} is called the total variation of g and is denoted by 





















726 


Deep Learning Architectures, A Mathematical Approach 


Example E.5.1 An increasing function g on 

with Vo(flO = 9(b) ~ g(a ). 


a, 6] is of bounded variation, 


The following resuit is an analog of Arzela-Ascoli theorem for noncon- 
tinnous functions and can be used for extracting a convergent seqnence of 
functions from a given set. 


Theorem E.5.4 (Helly) Let JC be an infinite set of functions from [a, b ] to 
M such that: 

(i) JC is uniformly bounded, i.e., 3C > 0 such that sup \f(x)\ < C, for all 

xe[a,b] 

/€ JC; 

(ii) 3V > 0 such that \J b a (f) < V, for all f E JC. 

Then we can choose a sequence ( f n ) n of functions in JC that is convergent at 
each point x E [a, b}. 


Recall that C[ 0,1] is the space of real-valned continuous functions dehned 
on [0,1], and it becomes a Banach space with respect to the norm || H^. 
The simplest continuous linear functional on C([ 0,1]) is F(f) — /(to), which 
assigns to / its value at a hxed point to- The next resuit States that the general 
form of these type of functionals are obtained by a Stieltjes combination of 
the aforementioned type of particular functions. 


Theorem E.5.5 Let F be a continuous linear functional on C([0,1]). There 
is a function g : [0,1] —> R with bounded variation, such that 


F(f) = [ f(t)dg(t ), 
J 0 


We also have 




v/GC([ 0,1]). 


We recall that the Stieltjes integral 
of the Riemann-type sums 


f f(t)dg(t) is dehned as the limit 
o 


m— 1 

T, f( x k)[9( x k+i ) - g(x k )] 
k =0 

as the norm of the division 0 = xq < x\ < • • • < x m — 1 tends to zero. 

The next resuit is a generalization of the previous one. Let K denote a 
compact set in R n , and denote by C(K ) the set of real-valned continuous 
functions on K. 



Appendix 


727 


Theorem E.5.6 Let F be a bounded linear functional on C(K). Then exists 
a unique finite signed Borei measure /i on K, such that 

F(f)= [ f(x)d»(x), V/ E C{K). 

Jk 

Moreover, ||F|| = |/i| (K). 

The next resuit replaces the boundness condition of the functional by 
positivity. In this case the signed measure becomes a measure. 

Theorem E.5.7 Let L be a positive linear functional on C(K). Then exists 
a unique finite Borei measure /i on K, such that 

L(f)= [ f(x)dfi(x), V/ € C(K). 

JK 

E.6 Fixed Point Theorem 

Let (M, d) be a metric space and T : M -E M be a mapping of M into itself. 
T is called a contraction if 

d(T(x), T(x r )) < A d(x,x'), Vx, x E M 

for some positive constant A less than 1. If the metric is induced by a norm 
on M, i.e., if d(x, x') — \\x — x'\\, the contraction condition can be written as 
|| T(x) — T(x ')|| < X\\x — x'\\. 

A sequence (x n ) of points in a metric space (M, d) is called a Cauchy 
sequence if for any e > 0, there is an 7V > 1 such that d(x n ,x m ) < e, for all 
n,m > N. 

The metric space (M, d) is called complete if any Cauchy sequence (x n ) in 
M is convergent, i.e., there is x* E M such that for all e > 0, there is N > 1 
such that d(x n , x*) < e, for all n > N. 

Example E.6.1 The space R n with the Euclidean distance forms a complete 
metric space. 

Example E. 6. 2 The space of linear operators {L; L : R n —> R n } endowed 

with the metric induced by the norm ||L|| = sup^g is a complete metric 
space. 

Example E.6.3 The space C[a, b\ = {/ : [a, b ] -E R; /continuous} is a com¬ 
plete metric space with the metric d(f,g) — sup^^j |/(x) — ^(x)|. 



728 


Deep Learning Architectures, A Mathematical Approach 



Figure 1: a. A fixed point of a continuous function f : [0,1] —> [0,1]. b. A 
fixed point of an increasing function f : [0,1] —> [0,1]. 

A point x* E M is called a fixed point for the application T : M —> M if 
T(x*) — x*. 

Example E.6.4 Any continuous function / : [0,1] —>► [0,1] has at least one 
fixed point. This follows geometrically from the fact that any continuous curve 
joining two arbitrary points situated on opposite sides of a square intersects 
the diagonal y — x, see Fig. 1 a. It is worth noting that Knaster proved 
that the fixed point property holds also in the case when the function / is 
monotonically increasing instead of continuous, see Fig. 1 b. 

Theorem E.6.5 A contraction T of the complete metric space (M, d) into 
itself has a unique fixed point. 

For any point xo G M, the sequence (x n ) dehned by x n +i = T{x n ) converges 
toward the fixed point, x n —> x*. Moreover, we have the estimation 

X n 

1 — A 


d(x n ,x*) < 


d(x o, xi). 








Appendix F 


Real Analysis 


F.l Inverse Function Theorem 

The Inverse Function Theorem is a resuit that States the local invertibility of 
a continuous differentiable function from R n to R n , which satisfies a nonzero 
Jacobian condition at a point. We recall that the Jacobian of the function 
F — (Fi,..., F n ) is the n x n matrix of partial derivatives 

dFj(x) \ 
dxj Ji/ 

Theorem F.l.l Let f : R n -A R n be a continuously differentiable function 
and p E R n be a point such that det Jf(p) f 2 0. Then there are two open 
sets U and V that contain points p and q — F(p), respectively, such that 
F\u : U —>> V is invertible, with the inverse continuously differentiable. The 
Jacobian of the inverse is given by J F -i(q) — 

The theorem can be reformulated in mathematical jargon by stating that 
a continuously differentiable function with a nonsingular Jacobian at a point 
is a local diffeomorphism around that point. 

Another way to formulate the theorem is in terms of nonlinear Systems 
of equations. Consider the System of n eqnations with n unknowns 

Vi 

• • • 

Vm 




© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.100T/9T8-3-030-36721-3 


729 



730 


Deep Learning Architectures, A Mathematical Approach 


and assume there is a point x° G R n such that det(§^)(x°) ^ 0. Then there 

exists two open sets U and V about x° and y° — F(x°) such that the System 
can be solved with unique solution as long as x G U and y G V. This means 
there are n continuous differentiable functions Gi : V -G R such that 

x\ G\{yj\^ • • • 5 Uti) 

• • • ••• ••• ••• 

%n — Gfi^yi ,..., y n ) 
for any y = (yi,..., y n ) G V. 

We note that this is an existential resuit, which does not construet explic- 
itly the solution of the System. However, in the particular case when the func- 
tion is linear, i.e., F{x) — Ax , with A nonsingular sqnare matrix, the linear 
system Ax — y has a unique solution x — A~ l y. The solution in this case is 
global, since the Jacobian Jf(x) — A has a nonzero determinant everywhere. 

F.2 Differentiation in generalized sense 

How do we differentiate a fnnction that is not differentiable everywhere? If 
we have a piecewise differentiable function, different iat ing it piecewise does 
not always provide the right resuit since we can’t find the derivative at the 
contact points. The “derivative” of a non-differentiable fnnction might exist 
sometimes in the so-called generalized sense. Let / : R G R be a function. 
We say that g is the derivative of fnnction / in the generalized sense if 

/ g(x)ip(x) dx — — / f(x)ip f (x)dx , (F.2.1) 

J IR J M 

for any compact supported smooth function We note that the generalized 
differentiation is an extension of classical differentiation, since the previous 
relation becomes the familiar integration by parts formula 

/ f'(x)(p(x) dx — — / f(x)(p\x)dx. 

J M J M 

Example F. 2.1 Let f(x) — H(x) be the Heaviside step function. Then its 
derivative is given by Dirac’s function, f'(x) — S(x). We shall check it next 
by computing both sides of relation (F.2.1): 

/ S(x)cp(x) dx = / (p(x)S(dx) = (^( 0 ) 

J M J M 

r r roo 

— / f(x)(p\x) dx — — H{x)(p\x) dx —— / <p'(x) dx — ^?(0) — <p(oo) — <p(0). 

J IR. J M J 0 

Similarly, the derivative of H{pc — a) is S a (x), where S a (x) = S(x — a). 



Appendix 


731 


Example F.2.2 The derivative of ReLU(x) is the Heaviside function. 
ReLU' (;X ) = H (x). We can check it using relation (F.2.1): 

r r roo 

/ ReLU'(x)(f(x) dx — — ReLU(x)tp\x) dx — — / xtp\x)dx 

J M X M J 0 

o r oo n 

x f cp(x)dx= / Lp(x) dx — / H(x)ip{x)dx, 

J 0 il 


for any 99 E Cg°(R). Hence, ReLU' (x) — H(x), in the generalized sense. 


F.3 Convergence of sequences of functions 

Let / n : R E I be a sequence of functions. Then f n can approximate a 
function / in several ways. 

1. The sequence of functions (f n ) n is pointwise convergent to / if f n (x) 
converges to f(x) for any x E R. 

2. Let / E L 2 (R). The sequence of functions (f n ) n is L 2 -convergent to / 
if II fn ~ /|| 2 — y 0, as 77 / — y 00 . This means 


lim 

n — t>oo 


f \fn(x ) - /(x)| 2 dx = 0. 


3. The sequence (/ n ) n is weakly convergent to / if 


lim 

n—?► 00 



f n (x)(p(x) dx 



f(x)ip(x) dx , 


V<£> E C 0 °°(M). 


It is worth noting that the L 2 -convergence implies both the pointwise 
convergence and the weak convergence. The latter follows from an application 
of Cauchy’s inequality 



/ ( fn~f)<P 

IA 

1 

Jr 

J\R 

II fn ~ /||2 IMI 

2 


and the Squeeze Theorem. 








Appendix G 

Linear Algebra 


G.l Eigenvalues, Norm, and Inverse Matrix 

Consider a matrix with n rows and m columns, A E given by A — 

( ctij ). The transpose of the matrix A, denoted by A T , is given by A T — (aji) 
and satisfies ( A T ) T — A , ( AB) T — B T A T . A matrix is called symmetric if 
A = A T . A necessary condition for symmetry is that A has to be a square 
matrix, i.e., n = m. If AA T — I (the identity matrix), then the matrix is 
called orthogonal. 

A number A (real or complex) is called an eigenvalue of the square matrix 
A if there is a nonzero vector x such that Ax = Ax. The vector x is called 
an eigenvector of the matrix A. The eigenvalues are the Solutions of the 
polynomial equation det (A — AI) = 0. 

Propositiori G.l.l Let A be a symmetric matrix. Then A has n real eigen¬ 
values, not necessarily distinet, Ai,...,A n; and n eigenvectors, x\,... ,x n , 
which form an orthonormal basis in W 1 . 


A matrix of type n x 1 is called a vector. If w and b are two vectors, then 
( w T b ) = ( wb T ) = (w, b ), where (,) is the Euclidean scalar product on W 1 . 

One can define several norms on R n . Let x T = (aq,..., x n ) be a vector. 
Then 


n 


X 


1 


= £ 


X, 


X 


i =1 



i —1 


X 

oo max 

Xi 


1 <i<n 



© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10T00T/9T8-3-030-36721-3 


733 

















734 


Deep Learning Architectures, A Mathematical Approach 


are three norms that are used in this book. It can be shown that there are 
two constants C\. C 2 > 0 such that 


1 < 11x11 2 < C2||x 


CiUx; 

Geometrically, this means that any 
in a II • Hoo-ball and also includes a 


00 5 


Vx e R n . 

2 -ball can be simultaneously included 
i-ball. 


Each of the previous norms induces a norm for the sqnared matrix A. 
Inspired by the norm of a linear operator, we dehne 

Ax || 


A || — sup 

x/0 


X 


(G.l.l) 


where 


• || is any of the aforementioned norms on R n . We note the inequalities 
Axi < \\A\\ ||x|| and ||AP>|| < \\A\\ ||i?||, for any other square matrix B. The 
norms induced by the previous three norms are 


n 


n 


^ = /w E 


l< j< n Lm 

i =1 


\ a ij 


AWoo = max Y 

l<i<n^ 


dij 


\Ah = y/p(A T A) 


3 = 1 


where p denotes the spectral radius of a matrix, given by p(A) = max|Aj 
where A j represent the eigenvalnes of A. The matrix B — A T A, being sym- 
metric, by Proposition G.l.l it has real eigenvalnes. Since A and A T have 
eqnal eigenvalues, the eigenvalues of A T A are A?, and hence A T A is posi¬ 
tive definite. Therefore, p(A T A) — maxA|, and then 1111 2 = max|A^| is the 
largest absolute valne of eigenvalnes of matrix A. 

It is worth noting that if A is symmetric, then 1111 2 = p(A). Also, ||A ||2 is 
the smallest among all norms generated by formula (G.l.l), i.e., 11^411 2 ^ Pii- 
This can be shown like in the following. Let A be the eigenvalue with the 
largest absolute value and x an eigenvector of length one. Then 


|A || 2 |A 


Axi 


Ax|| < ||A|| 11 x 11 < ||A||. 


We also have ||A ||2 > ^|Tr(A)| and ||A ||2 > \ &et A\ l / n . 

Proposition G.1.2 Let A be a square matrix. 

(i) The power matrix A m converges to the zero matrix as m —> 00 if and only 
if p(A) < 1 . 

(ii) If p(A) < 1 then I — A is invertible and we have 

(I - A)- 1 = I + A + A 2 + • • • A m + • • • (G.1.2) 

(m) If the geometric series (G.1.2) converges, then p(A) < 1. 

Corollary G.1.3 Let A be a square matrix and || • || be a norm. If \\A\\ < 1, 
then I — A is invertible with the inverse given by (G.1.2) and 

1 


















































Appendix 


735 


In the following we shall deal with a procedure of finding the inverse of a 
sum of two square matrices. We shall explain first the idea in the 1-dimensional 
case. Assume < 21 , <22 G R\{0} are two real numbers with a\ + 7 ^ 0. A simple 

algebraic computation yields 


a\ 


CLi + Ci2 


<12 

1 

a\ 


Ol \ + (22 &2 
1 CL2 

CL\ + (22 <2i 


(G.1.3) 


We construet two linear functions /, g : R —> R given by /(x) =-- 2 ^- 

&2 

and g(x) — - -x -\ -, and consider two cases: 

<21 ai 


1. If \ai\ < A|< 22 1, with 0 < A < 1 , then 


I f(x) - f(x') 


a\ 

C 12 


_ rp 


< \\x ~ X 


Therefore, / is a contraction of the complete metric space (R, | |) into itself, 
and hence it has a unique fixed point, see Theorem E. 6 .5. The hxed point, x*, 
satisfies /(x*) = x* and is given by x* = ai + a2 • Its approximat ion sequence 
is (x n ), dehned by the recurrence x n+ i = /(x n ), and xq = 0. The error 


A 


1 


estimation is given by \x n — x* < x\ — xo| = (i_a) \a 2 \’ 

2 . If \d 2 \ < A|ai|, with 0 < A < 1 , then | g(x) — g(x')\ < X 
contraction of R, and hence it has a unique fixed point, which is 1 . The 
details are similar with the first case. 


_ rp' 


, so g is a 
1 

CL\ — 1 — 0-2 


Consider now two invertible n x n matrices, A 1 , A 2 . We claim that 


(Ai + A 2 ) 1 — A 2 1 — (Ai + A 2 ) 1 AiA 2 


1 


-1 


1 


(G.1.4) 


which is the analog relation of (G.1.3) for matrices. This can be shown by a 
mere multiplication on both sides by (A\ + A 2 ) as follows: 

I = [Ai + A 2 )A 2 1 - A\A 2 x 
I = A 1 A 2 1 + I - A 1 A 2 1 . 

Assuming ||xTivT^ - 1 1| < 1 , it follows that 1 + A\A 2 X is invertible, see Corollary 
G.1.3. Then solving for (A\ + A 2) -1 from relation (G.1.4) yields 

( A\ + A 2 ) 1 — A 2 ■*■(! + AiA 2 ■*■) 1 . (G.1.5) 


This closed-form formula for the inverse of a sum of two matrices can’t be used 
in practice. For computational reasons we consider the following two ways: 






































736 


Deep Learning Architectures, A Mathematical Approach 


1. Consider the map / : A i n xn —> A4 nxn , /(M) = A 2 1 — MA\A 2 1 . Since 


/(M)-/(M')| 


= ||(M'-M)AiAj 1 || < ||M'-M||||AiA 2 


-1 


< A||M — M' 


then / is a contraction of A4 nXn into itself. The space A4 nXn is complete, 
since any matrix is associated with a linear operator, and the space of linear 
operators on R n is complete. By the fixed point theorem, the mapping / has 
a unique fixed point, M*, i.e., /(M*) = M*. From (G.1.4) it follows that 
M* = [A\ + A 2 ) -1 . This inverse can be approximated by the sequence of 
matrices (M n ), given by 1 = /(M n ), Mq = O. The error is estimated by 


M n - M* 


A n 

< j—^ll^i Mo| 


X n 

O 3 Aj 



2. Another way to approximate (Ai + A 2 ) 1 is to expand (G.1.4) in a series, 
see Proposition G.1.2 

(A! + A 2)- 1 = + AiA, 1 )- 1 = A, 1 J](-l) fc (A 1 A^ 1 ) fc . 

k> 0 


The previous computation was conducted under the condition ||vTivT^ -1 1| < 1. 
This implies ^(AiA^ 1 ) < 1, or p(A\) < />(^ 2 ), or A^(Ai) < A^(A 2 ), for all 
i E 1,..., n, i.e., the matrix A\ has smaller eigenvalues that A 2 , respectively. 

It is worth noting that dne to symmetry reasons, the roles of A\ and 
A 2 can be inverted, and similar formulas can be obtained if A\ is assumed 
invertible. 

The proof of the following resuit is just by mere multiplication. 


Lemma G.1.4 (Matrix inversion lemma) Let A,B<E Aim Xn be positive 
definite matrices, and C E Ai mXn , D E A4 nxn positive definite. If 


A = B~ l + CD~ 1 C t 


then 

A -1 — B — BC(D + C t BC)~ 1 C t B. 


G.2 Moore-Penrose Pseudoinverse 

A linear system is called overdetermined if it has more equations than 
unknowns. Typically, this type of Systems do not have any Solutions. The 
Moore-Penrose pseudoinverse method provides an approximate solution, [89, 
97], which, for all practical purposes, is good in some sense specihed at the 
end of this section. 






















Appendix 


737 


We start by considering a linear system in matrix form, AX — b, where 
A is an nn x n matrix, with mn > n (more rows than columns), X is an n- 
dimensional unknown vector, and b is an m-dimensional given vector. Since 
A is not a square matrix, the inverse A _1 does not make sense in this case. 
However, there are good chances that the square matrix A T A is invertible . 1 
For instance, if A has full rank, rankA = n, then rankA T A = rankA = n, so 
the nx n matrix A T A has maximal rank, and hence det A T A 7 ^ 0 , i.e., A T A 
is invertible. 

Then multiplying the equation by the transpose matrix, A T , to the left 
we obtain A T AX — A T b. Assuming that A T A is invertible, we obtain the 
solntion X — (A T A)~ 1 A T b. The pseudoinverse of A is dehned by the n x m 
matrix 

= (A T A)~ 1 A T . (G. 2 . 6 ) 

In this case the Moore-Penrose pseudoinverse solution of the overdetermined 
system AX — b is given by X = A+b. 

In the case when A is invertible we have = A -1 , i.e., the pseudoinverse 
is a generalization of the inverse of a matrix. 

It is worth noting that if the matrix A has more columns than rows, i.e, 
n > m, then A T A does not have an inverse, since det A T A — 0. This follows 
from the rank evalnation of the n-dimensional matrix A T A 


Y&nkA T A 


rankA < min{n, rn} — m < n. 


In this case the pseudoinverse A + , even if it always exists, cannot be expressed 
by the explicit formula (G. 2 . 6 ). 

Geometric Significance Consider the linear mapping F : R n -A R m , 
F(X) — AX, with n < m. The matrix A is assumed to have full rank, 
rankA = n. In this case the range of F is the following linear subspace of 
R m : 


U = {AX;X eR n }, 


of dimension dimlZ — rank A — n. 

Now, given a vector b G R m , not necessary contained in the space 7 Z, we 
try to approximately solve the linear system AX — b, using the minimum 
norm solution. This is a vector X* E R n , which minimizes the L 2 -norm of 
the difference AX — b, i.e., 



arg min 

xm n 


AX-b || 2 . 


(G.2.7) 


x This follows from the fact that the matrices A satisfying the algebraic equation 
det A t A = 0 form a negligible set in the set of m x n matrices. 





738 


Deep Learning Architectures, A Mathematical Approach 



Figure 1: The geometric interpretation of the pseudoinverse solutiori X* = 
A+b. 


Geometrically, this means that AX * is the point in the space P, which is 
closest to 6, i.e., it is the orthogonal projection of b onto P, see Fig. 1. 

Let b denote the orthogonal projection of b onto the space P, and consider 
the linear system AX — b. Since b G P, the System does have Solutions. The 
uniqueness follows from the maximum rank condition of matrix A. Therefore, 
there is a unique vector X* G R n such that XX* = b. This is the solution 
claimed by the equation (G.2.7). Equivalently, this means 


AX -b || 2 > || AX* - 6|| 2 = ||6 - &|| 2 . 


VX G RT 


Next we focus on the expression of the projection b. It can be shown that 
b — AA + b. This follows from the fact that the linear operator P : R m — > R m 
given by P — XX + = X(X T X) -1 A T is the orthogonal projector of R m onto 
the subspace 7 Z. This resuit is implied by the following three properties, 
which can be checked by a straightforward computationi P 2 — P, P T — P, 
and PX = X; the former means that P is a projector, the latter means that P 
is an orthonormal projector, and the last means that the space P is invariant 
by P. Since X* = A + b verihes the equation XX = 6, namely, 

XX* = X(X + 6) = (XX + )6 = Pb = 6, 

then X* represents the pseudoinverse solution of the system XX = b. 

One hrst application of the Moore-Penrose pseudoinverse is hnding the 
line of best fit for m given points in the plane of coordinates (oq,yi), 
(X2, 2 / 2)5 .(x m ,y m ). If the line has the equation y — ax + 6, we shall 














Appendix 


739 


write the following overdetermined System of m equations: 


ax i + 6 = 

!Jl 

ax m + b = 

Urn 


which can be written in the equivalent matrix form 


(Xl 1 \ 

x 2 1 

• • 

• • 

• • 

\x m i~) 

V -V-' 

A 


(a b ) = 


/ Z/i \ 

V2 

\jJmJ 

Y 


We note that in this case n — 2, because there are only two parameters to 
determine. A straightforward computation shows 


2 

22 x % 


a t a= 1 ||x|r ^ Xl 


n 


(A t A)~ 1 = 


n 


n 


x 


- 'y y 2^1 

2 - (X Xi) 2 X Xi "~" 2 


X 


Since 


a t y = 


y x^i 

X Vi 


the pseudoinverse solntion is 


Q =A + Y = (A t A)~ 1 A t Y = 


n X XiVi -YxiYVi 


n 


X 


2 - (E^) 2 VIMrE^ - J2 x iJ2 x iyif 


This provides the familiar expressions for the regression line coefficients 

n X XiVi -YxiYVi i. X x\ X Vi - X Xi X XiVi 

a = -—-o-77=^-777-, 0 = — 


nJ2 x i ~ (Y x i) 2 


n Y x i - (Y x i ) 2 


It is worth noting that a similar approach can be applied to polynomial 
regression. 


Propositiori G.2.1 Let A be an m x n matrix of rank n. 

(i) A T A is positive definite and invertible; 

(ii) lim e~ A At — O n . 


Then 


t —^oo 




740 


Deep Learning Architectures, A Mathematical Approach 


Proof: ( i ) Since for any x E R n we have 

(A t Ax,x) — x t A t Ax — ||7Lr|| 2 > 0, 


the matrix A T A is positive definite. Using the properties of matrix rank, 
rank(A T A) — rank (A) — n, so the matrix A T A has a maximum rank and 
hence it is invertible. 

(ii) Part (i) can be formulated by stating that the matrix A T A has positive, 
nonzero eigenvalues, otj >0, 1 < j < n. Let M be an invertible n x n matrix 
which diagonalizes A T A, namely, A T A — MDiag(aj)M~ 1 . Then (A T A) k — 
MDiag(a k )M~ 1 , and hence 


e-A^M 


E(-l )HA T A) k 

k> 0 


t k 

k\ 


k> 0 


, k 

k> 0 

M Diag(e~ ajt )M~ 1 . 


M Diag (^(-l) 


k> 0 


fk 

k a k - 

J k\ 



Using lim e ajt — 0, it follows that lim e A At = O n . 

t—YOO t—> OO 



Bibliography 


[1] E. Aarts, J. Korst, Simulated Annealing and Boltzmann Machines 
(John Wiley, Chichester, UK, 1989) 

[2] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for 
boltzmann machines. Cogn. Sci. 9, 147-169 (1985) 

[3] S. Amari, Theory of adaptive pattern classifiers. IEEE Trans. Comput. 
EC-16(3), 299-307 (1967) 

[4] S. Amari, Differential-Geometrical Methods in Statistics. Lecture Notes 
in Statistics, vol. 28 (Springer, Berlin, 1985) 

[5] S. Amari, Information geometry of the EM and em algorithms for neu- 
ral networks. Neural Netw. 8(9), 1379-1408 (1995) 

[6] S. Amari, Natural gradient works efficiently in learning. Neural Com¬ 
put. 10(2), 251-276 (1998) 

[7] S. Amari, Information Geometry and Its Applications . Applied Math- 
ematical Sciences Book, vol. 194, lst edn. (Springer, New York, 2016) 

[8] S. Amari, H. Park, F. Fukumizu, Adaptive method of realizing natural 
gradient learning for multilayer perceptrons. Neural Comput. 12 , 1399- 
1409 (2000) 

[9] V.I. Arnold, On functions of three variables. Dokl. Akad. Nauk SSSR 
114, 953-965 (1957) 

[10] R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Understanding deep neu¬ 
ral networks with rectified linear units. ICLR (2018) 


© Springer Nature Switzerland AG 2020 

O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3 


741 


742 


Deep Learning Architectures, A Mathematical Approach 


[11] R.B. Ash, Information Theory (Dover Publications, New York, 1990) 

[12] A. Barron, Universal approximation bounds for superpositions of a sig- 
moidal function. IEEE Trans. Inf. Theory 39, (1993) 

/ 

[13] G. Bartok, C. Szepesvari, S. Zilles, Models of active learning in 
grouped-structured state spaces. Inf. Comput. 208(4), 364-384 (2010) 

[14] Y. Bengio, P. Frasconi, P. Simard, The problem of learning long-term 
dependencies in recurrent networks, in IEEE International Conference 
on Neural Networks , San Francisco (IEEE Press, 1993), pp. 1183-1195 

[15] Y. Bengio, P. Frasconi, P. Simard, Learning long-term dependencies 
with gradient descent is difficult. IEEE Trans. Neural Netw. (1994) 

[16] J. Bergstra, G. Desjardins, P. Lamblin, Y. Bengio, Quadratio polyno- 
mials learn better image features. Technical Report 1337 (Departement 
d’Informatique et de Recherche Operationnelle, Universite de Montreal, 
2009) 

[17] H. Bohr, Zur theorie der fastperiodischen funktionen i. Acta Math. 45, 
29-127 (1925) 

[18] A.E. Bryson, A gradient method for optimizing multi-stage allocation 
processes, in Proceedings of the Harvard University Symposium on Dig¬ 
ital Computers and Their Applications , April 1961 

[19] P.C. Bnsh, T.J. Sejnowski, The Cortical Neuron (Oxford University 
Press, Oxford, 1995) 

[20] O. Calin, An Informal Introduction to Stochastic Calculus with Appli¬ 
cations (World Scientific, Singapore, 2015) 

[21] O. Calin, Entropy maximizing curves. Rev. Roum. Math. Pures Appi. 
63(2), 91-106 (2018) 

[22] O. Calin, C. Udriste, Geometric Modelling in Probability and Statistics 
(Springer, New York, 2014) 

[23] E. Qinlar, Probability and Stochastics. Graduate Texts in Mathematics, 
vol. 261 (Springer, New York, 2011) 

[24] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, 
H. Schwenk, Y. Bengio. Learning phrase representations using rnn 
encoder-decoder for statistical machine translation. IEEE Trans. Neu¬ 
ral Netw. (2014), arXiv: 1406.1078 


Bibliography 


743 


[25] M.A. Cohen, S. Grossberg, Absolute stability of global pattern Infor¬ 
mation and parallel memory storage by competitive neural networks. 
IEEE Trans. Syst. Man Cybern. SMC-13, 815-826 (1983) 

[26] T.S. Cohen, M. Geiger, J. Kohler, M. Welling, Spherical CNNS. ICRL 
(2018), https://openreview.net/pdf?id=Hkbd5xZRb 

[27] T.S. Cohen, M. Welling, Group equivariant convolutional networks 
(2016), https://arxiv.org/abs/1602.07576 

[28] J.M. Corcuera, F. Giummole, A characterization of monotone and reg- 
ular divergences. Ann. Inst. Stat. Math. 50 ( 3 ), 433-450 (1998) 

[29] R. Courant, D. Hilbert, Methods of Mathematical Physics , 2nd edn. 
(Interscience Publishers, New York, 1955) 

[30] G. Cybenko, Approximation by superposition of a sigmoidal function. 
Math. Control Signals Syst. 2 , 303-314 (1989) 

[31] K. Diederik, J. Ba, Adam: a method for stochastic optimization (2014), 
arXiv:1412.6980 

[32] R.J. Douglas, C. Koch, K.A. Martin, H.H. Suarez, Recurrent excitation 
in neocortical circuits. Science 269(5226), 981-985 (1995). https://doi. 
org/10.1126/Science.7638624 

[33] S.E. Dreyfus, The numerical Solutions of variational problems. J. Math. 
Anal. Appi. 5 , 30-45 (1962) 

[34] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online 
learning and stochastic optimization. JMLR 12 , 2121-2159 (2011) 

[35] D. Dufresne, Fitting combinations of exponentials to probability distri- 
butions. Appi. Stoch. Model Bus. Ind. 23(1), (2006). https://doi.org/ 
10.1002/asmb.635 

[36] N. Dunford, J.T. Schwartz, Linear Operators. Pure and Applied Math- 
ematics, vol. 1 (Interscience Publishers, New York, 1957) 

[37] E.B. Dynkin, Markov Processes I, II (Springer, Berlin, 1965) 

[38] A. Einstein, Investigations on the Theory of Brownian Movement 
(Dover Publications, Mineola, 1956) translated by A.D. Cowper 

[39] B.R. Frieden, Science from Fisher Information , 2nd edn. (Cambridge 
University Press, Cambridge, 2004) 


744 


Deep Learning Architectures, A Mathematical Approach 


[40] B.R. Frieden, Extreme physical informat ion as a principle of univer- 
sal stability, in Information Theory and Statistical Learning , ed. by F. 
Emmert-Streib, M. Dehmer (Springer, Boston, 2009) 

[41] B.R. Frieden, B.H. Soffer, Lagrangians of physics and the game of 
fisher-information transfer. Phys. Rev. E 52, 2274-2286 (1995) 

[42] F.A. Gers, J. Schmidhuber, LSTM recurrent networks learn simple con- 
text free and context sensitive languages. IEEE Trans. Neural Netw. 
12 , 1333-1340 (2001) 

[43] X. Glorot, Y. Bengio, Understanding the difficulty of training deep 
feedforward neural networks, in AISTATS’2010 (2010) 

[44] X. Glorot, Y. Bengio, Undertanding the difficulty of training deep 
feedforward neural networks, in Proceedings of the 13th International 
Conference on Artificial Intelligence and Statistics 2010 , Chia Laguna 
resort, Sardinia, Italy. JMLR, vol. 9 (2010) 

[45] X. Glorot, A. Borders, Y. Bengio, Deep sparse rectiher neural networks, 
in Proceedings of the l^th International Conference on Artificial Intel¬ 
ligence and Statistics 201A Fort Lauderdale, FL, USA (2011) 

[46] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 
Cambridge, 2016), http://www.deeplearningbook.org 

[47] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, 
S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, in 
NIPS (2014) 

[48] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Ben¬ 
gio, Maxout networks, in ICML’13, ed. by S. Dasgupta, D. McAllester 
(2013), pp.1319-1327 

[49] L. Goodman, On the exact variance of products. J. Am. Stat. Assoc. 
55(292), 708-713 (1960). https://doi.org/10.2307/2281592., JSTOR 
2281592 

[50] A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep 
recurrent neural networks, in ICASSP (2013), pp. 6645-6649 

[51] P.R. Halmos, Measure Theory. The University Series in Higher Math- 
ematics, 7th edn. (Van Nostrand Company, Princeton, 1961) 

[52] B. Hanin, Universal fnnction approximation by deep neural nets with 
bounded width and relu activations (2017), arXiv: 1708.02691 


Bibliography 


745 


[53] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical 
Learning , 2nd edn. (Springer, New York, 2017) 

[54] D. Hilbert, Grundzuge einer allgemeinen theorie der linearen integral- 
gleichungen i. Gott. Nachrichten, math.-phys. K1 (1904), pp. 49-91 

[55] S. Hochreiter, Untersuchungen zu dynamischen neuronalen netzen, 
Diploma thesis, Technische Universitat Miinchen, 1991 

[56] S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Com- 
put. 9 , 1735-1780 (1997) 

[57] J. Hopheld, Neural networks and physical Systems with emergent col¬ 
lective computational abilities. Proc. Nati. Acad. Sci. 79 , 2554-2558 
(1982) 

[58] J.J. Hopheld, Neurons with graded response have collective computa¬ 
tional properties like those of two-state neurons. Proc. Nati. Acad. Sci. 
81 , 3088-3092 (1984) 

[59] K. Hornik, M. Stinchcombe, H. White, Multilayer feed-forward net¬ 
works are universal approximators. Neural Netw. 2, 359-366 (1989) 

[60] B. Irie, S. Miyake, Capabilities of three-layered perceptrons, in IEEE 
International Conference on Neural Networks , vol. 1 (1988), pp. 641- 
648 

[61] E. Ising, Beitrag zur theorie des ferromagnetismus. Z. fur Phys. 31 , 253 
(1925) 

[62] H.J. Kelley, Gradient theory of optimal hight paths. ARS J. 30(10), 
947-954 (1960). https://doi.Org/10.2514/8.5282 

[63] S. Kirkpatrick, C. Gelatt, M. Vecchi, Optimization by simulated anneal- 
ing. Science 220, 671-680 (1983) 

[64] A.N. Kolmogorov, On the representation of continuous functions of 
many variables by superposition of continuous functions in one variable 
and addition. Dokl. Akad. Nauk. SSSR 144 , 679-681 (1957). American 
Mathematical Society Translation, 28 , 55-59 (1963) 

[65] R. Kondor, Group Theoretical Methods in Machine Learning (Columbia 
University, New York, 2008) 

[66] R. Kondor, S. Trivedi, On the generalization of equivariance and con- 
volution in neural networks to the action of compact groups (2018), 
https: / / arxiv.org/ abs /1802.03690 


746 


Deep Learning Architectures, A Mathematical Approach 


[67] B. Kosko, Bidirectional associative memories. IEEE Trans. Syst. Man 
Cybern. 18, 49-60 (1988) 

[68] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classihcation with 
deep convolutional neural networks, in NIPS’2012 (2012) 

[69] S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. 
Stat. 22, 79 (1951) 

[70] S. Kullback, R.A. Leibler, Information Theory and Statistics (Wiley, 
New York, 1959) 

[71] S. Kullback, R.A. Leibler, Letter to the editor: the Kullback-Leibler 
distance. Am. Stat. 41(4), (1987) 

[72] L.D. Landau, E.M. Lifshitz, Statistical Physics. Course of Theoretical 
Physics , vol. 5, translated by J.B. Sykes, M.J. Kearsley (Pergamon 
Press, Oxford, 1980) 

[73] Y. LeCun, Modeles connexionists de 1’apprentissage, Ph.D. thesis, Uni- 
versite de Paris VI, 1987 

[74] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning 
applied to document recognition, in Proceedings of the IEEE , November 
1998 

[75] Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and 
applications in vision, in Proceedings of 2010 IEEE International Sym¬ 
posium Circuits and Systems (ISCAS /, pp. 253-256 

[76] Y. Li, K. Swersky, R.S. Zemel, Generative moment matching networks. 
CoRR (2015), arXiv:abs/1502.02761 

[77] S. Linnainmaa, The representation of the cumulative rounding error 
of an algorithm as a Taylor expansion of the local rounding errors, 
Master’s Thesis (in Finnish), University of Helsinki, pp. 6-7, 1970 

[78] S. Linnainmaa, Taylor expansion of the accumulated rounding error. 
BIT Numer. Math. 16(2), 146-160 (1976). https://doi.org/10.1007/ 
bf01931367 

[79] Z. Lu, H. Pu, , F. Wang, Z. Hn, L. Wang, The expressive power of neural 
networks: a view from the width, in Neural Information Processing 
Systems (2017), pp. 6231-6239 


Bibliography 


747 


[80] M.E. Hoff Jr., Learning phenomena in networks of adaptive circuits. 
Ph.D. thesis, Tech Rep. 1554-1, Stanford Electron. Labs., Standford, 
CA, July 1962 

[81] D.J.C. MacKay, Information Theory, Inference, and Learning Algo- 
rithms (Cambridge University Press, Cambridge, 2003) 

[82] W.S. McCulloch, W. Pitts, A logical calculus of idea immanent in ner- 
vous activity. Bull. Math. Biophys. 5, 115-133 (1943) 

[83] J. Mercer, Functions of positive and negative type and their connection 
with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. 
A 209 , 415-446 (1909) 

[84] T. Mikolov, Statistical language models based on neural networks. 
Ph.D. thesis, Brno University of Technology, 2012 

[85] R.S. Millman, G.D. Parker, Elements of Differential Geometry 
(Prentice-Hall, Englewoods Cliffs, 1977) 

[86] M. Minsky, Neural nets and the brain: model problem. Dissertation, 
Princeton University, Princeton, 1954 

[87] M.L. Minsky, S.A. Papert, Perceptrons (MIT Press, Cambridge, 1969) 

[88] M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of Machine 
Learning , 2nd edn. (MIT Press, Boston, 2018) 

[89] E.H. Moore, On the reciprocal of the general algebraic matrix. Bull. 
Am. Math. Soc. 26(9), 394-95 (1920). https://doi.org/10.1090/S0002- 
9904-1920-03322-7 

[90] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltz- 
mann machines, in Proceedings of the 27th International Conference on 
Machine Learning 2010 (2010) 

[91] Y.A. Nesterov, A method of solving a convex programming problem 
with convergence rate o(l\V~k). Sov. Math. Dokl. 27 , 372-376 (1983) 

[92] M. Nielsen, Neural Networks and Deep Learning (2017), http://www. 
neuralnetworksanddeeplearning.com 

[93] D.B. Parker, Learning-Logic (MIT, Cambridge, 1985) 

[94] E. Parzen, On the estimation of a probability density function and its 
mode. Ann. Math. Stat. 32 , 1065-1076 (1962) 


748 


Deep Learning Architectures, A Mathematical Approach 


[95] P. Pascanu, Q. Giilcehre, K. Cho, Y. Bengio, How to construet deep 
recurrent neural networks, in ICLR (2014) 

[96] R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recur¬ 
rent neural networks, in ICML. Neural Computation (2013) 

[97] R. Penrose, A generalized inverse for matrices. Proc. Camb. Philos. Soc. 
51(3), 406-413 (1955). https://doi.org/10.1017/S0305004100030401 

[98] B.T. Polyak, Some methods of speeding up the convergence of iteration 
methods. USSR Comput. Math. Math. Phys. 4(5), 1-17 (1964) 

[99] M. Rattray, D. Saad, S. Amari, Natural gradient descent for one-line 
learning. Phys. Rev. Lett. 81, 5461-5465 (1998) 

[100] S. Ravanbakhsh, J. Schneider, B. Poczos, Equivariance through 
parameter-sharing, in Proceedings of International Conference on 
Machine learning (ICML) (2016), https://arxiv.org/pdf/1702.08389. 
pdf 

[101] R. Rojas, Neural Networks a Systemic Introduction (Springer, Berlin, 
1996) 

[102] F. Rosenblatt, The perceptron: a probabilistic model for Informa¬ 
tion storage and organization in the brain. Psychol. Rev. 65, 386-408 
(1958). Reprinted in: [Anderson and Rosenfeld 1988] 

[103] N. Le Roux, Y. Bengio, Deep belief networks are compact universal 
approximators. Neural Comput. 22(8), 2192-2207 (2010) 

[104] H.L. Royden, Real Analysis , 6th edn. (The Macmillan Company, New 
York, 1966) 

[105] H.L. Royden, P.M. Fitzpatrick, Real Analysis (Prentice Hali, 2010) 

[106] W. Rudin, Functional Analysis (International Series in Pure and 
Applied Mathematics (McGraw-Hill, New York, 1991) 

[107] D. Rumelhart, G.E. Hinton, J.R. Williams, Learning internal rep- 
resentations, in Parallel Distributed Processing: Explorations in the 
Micro structure of Cognition, Foundations (MIT Press, Cambridge, 
1986) 

[108] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representa- 
tions by back-propagating errors. Nature 323(6088), 533-536 (1986). 
https://doi.org/10.1038/323533a0. Bibcode:1986Natur.323..533R 


Bibliography 


749 


[109] J. Schmidhuber, Deep learning in neural networks: an overview. J. 
Math. Anal. Appi. (2014), https://arxiv.org/pdf/1404.7828.pdf 

[110] S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: 
From Theory to Algorithms (Cambridge University Press, Cambridge, 
2014) 

[111] C. Shannon, A mathematical theory of communicat ion. Bell Syst. Tech. 
J. 379 - 423 , 623-656 (1948) 

[112] K. Sharp, F. Matschinsky, Translation of ludwig boltzmamTs paper 
“on the relationship between the second fundamental theorem of the 
mechanical theory of heat and probability calculations regarding the 
conditions for thermal equilibrium”. Entropy 17 , 1971-2009 (2015). 
https://doi.org/10.3390/el7041971 

[113] Xingjian Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.K. Wong, W.C. Woo, 
Convolntional LSTM network: a machine learning approach for precip- 
itation nowcasting, in Proceedings of the 28th International Conference 
on Neural Information Processing Systems (2015), pp. 802-810 

[114] P. Smolensky, Information processing in dynamical Systems: founda- 
tions of harmony theory, in Parallel Distributed Processing , vol. 1, ed. 
by D.E. Rumelhart, J.L. McClelland (MIT Press, Cambridge, 1986), 
pp. 194-281 

[115] D. Sprecher, On the structure of continuous fnnctions of several vari- 
ables. Trans. Am. Math. Soc. 115 , 340-355 (1964) 

[116] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdi- 
nov, Dropout: a simple way to prevent neural networks from overhtting. 
J. Mach. Learn. Res. 15, 1929-1958 (2014) 

[117] K. Steinbuch, Automat und Mensch: Kybernetische Tatsachen und 
Hypothesen (Springer, Berlin, 1965) 

[118] T. Tieleman, G. Hinton, Lecture 6.5—rmsprop, coursera: neural net¬ 
works for machine learning. Technical report (2012) 

[119] N. Tishby, F.C. Pereira, W. Bialek, The information bottleneck 
method, in The 37th Annual Allerton Conference on Communications 
Control, and Computing (1999), pp. 368-377 

[120] D. Wackerly, W. Meddenhall, R. Scheaffer, Mathematical Statistics with 
Applications , 7th edn. (Brooks/Cole Cengage Learning, 2008) 


750 


Deep Learning Architectures, A Mathematical Approach 


[121] D. Wagenaar, Information Geometry for Neural Networks (Centre for 
Neural Networks; King’s College London, 1998) 

[122] S. Wang, X. Sun, Generalization of hinging hyperplanes. IEEE Trans. 
Inf. Theory 51(12), 4425-4431 (2005) 

[123] P.J. Werbos, Beyond regressioni New tools for prediction and analysis 
in the behavioral Sciences, Harvard University, 1975 

[124] P.J. Werbos, Applications of advances in nonlinear sensitivity analysis, 
in Proceedings of the lOth IFIP Conference, 31.8-4-9 NYC (1981), pp. 
762-770 

[125] B. Widrow, An adaptive “adaline” neuron using Chemical “memistors”. 
Tehnical Report 1553-2 (Office of Naval Research Contract, October 
1960) 

[126] B. Widrow, Generalization and information storage in networks of ada- 
line neurons, in Self-Organizing Systems , ed. by M. Yovitz, G. Jacobi, 
G. Goldstein (Spartan Books, Washington, 1962), pp. 435-461 

[127] B. Widrow, M.A. Lehr, 30 years of adaptive neural networks: per- 
ceptron, madaline ad backpropagation. Proc. IEEE 78(9), 1415-1442 
(1990) 

[128] N. Wiener, Tauberian theorems. Ann. Math. 33(1), 1-100 (1932) 

[129] H.R. Wilson, J.D. Cowan, Excitatory and inhibitory interactions in 
localized populations of model neurons. Biophys. J. 12 , 115-143 (1972) 

[130] H.H. Yang, S. Amari, Complexity issues in natural gradient descent 
method for training multilayer perceptrons. Neural Comput. 10 , 2137- 
2157 (1998) 

[131] R.W. Yeung, First Course in Information Theory (Kluwer, Dordrecht, 

2002) 

[132] Y.T. Zhou, R. Chellappa, Computation of optical flow using a neural 
network, in IEEE 1988 International Conference on Neural Networks 
(1988), pp. 71-78 



Index 


Symbols 

F-separable, 574 
EM-class network, 260 
E-class network, 260 
E-finite measure, 701 
d-dense, 253 
d-system, 697 
d^-convergence, 323 
p-system, 697 
e-close neural nets, 220 

A 

absolutely continuous, 703 
absolutely convergent, 304 
abstract neuron, 133 
action, 530 

activation function, 21, 217, 342 

AdaGrad, 100 

Adaline, 157 

Adam, 103 

AdaMax, 104 

adapted, 713 

adaptive implementation, 493 
algebra, 209 
almost periodic, 224 
analytic function, 260, 300 
AND, 136 

approximation sequence, 176 
approximation space, 253 
arctangent function, 29 
area, 433 


Arzela-Ascoli Theorem, 203, 205, 207, 
208, 726 

ascending sequence, 272 
autocorrelation, 154 
autoencoder, 360 
average 

of the function, 109 
average-pooling, 507 
axon, 9 

B 

backpropagation, 173, 177, 180, 237, 
483, 550 

Baire measure, 257, 700 
Banach space, 721 
basin of attraction, 78, 126 
batch, 188 
Bernoulli 

distribution, 710 
random variable, 459 
Bessel inequality, 303, 724 
bias, 5 

binary classifier, 148 
bipolar step function, 22 
body-cell, 9 
Bohr condition, 224 
Boltzmann 

constant, 107 
distribution, 109, 614 
learning, 621 
machine, 611, 617 
probability, 107 

751 


© Springer Nature Switzerland AG 2020 
O. Calin, Deep Learning Architectures, Springer Series in the Data Sciences, 
https://doi.org/10.1007/978-3-030-36721-3 


752 


Deep Learning Architectures, A Mathematical Approach 


Boole’s inequality, 700 
Borei 

0-algebra, 36 
6-field, 254, 268 
measurable, 268 
measurable functions, 254 
measure, 700, 705, 727 
sets, 565 

Borel-Cantelli Lemma, 715 
bottleneck, 393 
bounded 

functional, 33 
linear functional, 255 
Bounded Convergence Theorem, 

229, 702 
Brownian 

motion, 118, 719 
build-in frequencies, 212 
bump 

function, 229 

bumped-shaped function, 30 

C 

Cantor’s lemma, 202, 692 
capacity, 381 

Cauchy sequence, 215, 727 
Cauchy’s inequality, 205, 216, 307 
Central Limit Theorem, 188, 453, 717 
chain rule, 173 
characteristic function, 716 
Cholesky decomposition, 119 
Christoffel symbols, 424, 449, 627 
CIFAR data set, 63 
circular search, 122 
classification functions, 31 
cluster 

classification, 140 
commutator, 426 
compact, 692 
interval, 207 


metric space, 202 
set, 239, 253 
sets, 202 

support signal, 518-520 
complete, 727 

compressed information, 394 
compressibility conditions, 342, 344 
compression factor, 380 
compressionless layers, 376 
computerized tomography, 300 
conditional 

entropy, 363 
expectation, 44, 439 
expectations, 714 
model density function, 48 
conditional entropy, 378, 548 
contains constants, 209, 260 
continuum 

input neuron, 159 
number of neurons, 300 
contraction, 577, 728 
contraction principle, 215, 544 
convergence 

almost sure, 715 
in IP, 715 
in distribution, 324 
in mean square, 716 
in probability, 271, 715 
convergence conditions, 98 
convex 

functional, 723 
hull, 580 
separability, 575 
set, 571 

convolution, 229, 518 
kernel, 242 
operator, 521 
series, 98 

cooling process, 105 
cosine, 211 
cosine squasher, 36 


Index 


753 


cost function, 41, 49, 173, 185 
counting measure, 699 
covariance, 711 
covariant derivative, 425 
Cramer-Rao inequality, 470 
cross-correlation, 518, 519 
cross-entropy, 46, 146, 185, 609 
error, 147 

crystalline structure, 105 
current, 5 

curvature tensor, 627 

D 

damping coefficient, 94 
decision 

boundary curve, 587 
functions, 565 
line, 150 
map, 561, 576 
maps, 585 
decoder, 395 
deep learning, 215 
deep neural network, 300 
default probability, 145 
degenerate kernel, 306 
deitas, 181 
dendrites, 9 
dense, 256 
dense set, 208 
dependence tree, 175 
determinant, 342 
diagonalization procedure, 273 
differential entropy, 353 
diffusion matrix, 118 
Dini’s Theorem, 201, 245 
Dirac’s 

function, 22 

measure, 160, 228, 699, 701 
directional derivative, 81 
discrete entropy, 353 
discrete measure, 160, 699 
discriminatory 


function, 258 
in L 1 -sense, 266 
in L 2 -sense, 262 
discriminatory function, 32 
dispersion, 118 
distribution, 709 
distribution measure, 323 
divergence, 92, 94 

Dominated Convergence Theorem, 702 

double exponential, 30, 213 

drift, 118 

dropout, 454 

DynkhTs 

formula, 118, 719 
Theorem, 697 

E 

Egorov’s Theorems, 704 
eigenfunction, 303 
eigenvalue, 303, 734 
eigenvalues, 113 
eigenvectors, 113 
elementwise prodnct, 183 
ELU, 24 
empirical 

mean, 152 
probability, 146 
encoder, 395 
energy function, 618 
entropy, 195, 198, 351, 615 
binary, 411 
change, 355 
flow, 356 

of a partition, 563 
epoch, 177 
equicontinuous, 204 
equilibrium point, 91 
equivalence 
classes, 562 
relations, 561 

equivariance, 521, 533, 537 


754 


Deep Learning Architectures, A Mathematical Approach 


error function, 41 
error information, 319 
estimator, 46 
Euclidean 
ball, 272 

distance, 43, 215, 727 
gradient, 493 
inner prodnct, 75 
length, 441 
structure, 450 
Euclidean distance, 254 
Euclidean gradient, 488 
exact learning, 285, 300, 439, 620 
exact solution, 153 
expectation, 710 
exploding gradient, 554 
extended 

input, 134 
weights, 134 
extrinsic, 441 

F 

features, 377 

feedforward neural network, 

178, 281 

Fermafs theorem, 69 
filter, 518 
filtration, 327, 713 
finite 

energy signal, 517, 519, 520 
signal, 517, 519 
finite measure, 701 
first fundamental form, 424, 441 
first layer, 168 
Fisher 

information, 626 
Fisher information, 465, 470 
matrix, 471 

Fisher metric, 472, 499, 612 
fixed point, 216, 219, 699 
theorem, 727 
flattest manifold, 445 


forward 

pass, 179 

propagation formula, 182 
Fourier transform, 34, 214, 264, 268, 
298, 716 

functional independence, 343 

G 

gates, 554 

Gaussian, 30, 214, 300 
noise, 460 

generalization error, 60 
generalized derivative, 238 
generative moment matching networks, 
606 

geodesic, 442 

submanifold, 446 
global minimum, 69 
Goodman’s formula, 193, 711 
Google Street View images, 63 
GPU Systems, 177 
gradient ascent, 613 
gradient clipping, 554 
gradient descent, 43, 151, 154, 173, 
186 

algorithm, 73 
method, 82, 147 
gradient estimation, 156 
group, 531 

H 

Haar measure, 535 
Hadamard product, 184, 185, 459 
Hahn Decomposition Theorem, 706 
Hahn’s function, 29 
Hahn-Banach Theorem, 256, 261, 723 
half-space, 33 

Hamiltonian function, 91, 92 
harmonic function, 73 
Heaviside function, 22, 227, 230 
hedging application, 241 
Heisenberg group, 532 


Index 


755 


Helley’s theorem, 37 
Hellinger distance, 55 
Helly’s theorem, 726 
Hessian 

matrix, 71, 113, 115 
method, 113 
hidden 

layer, 167, 178, 205 
hidden layer, 334 
Hilbert 

space, 44, 45, 723, 724 
Hilbert’s thirteenth problem, 296 
hockey-stick function, 23 
homeomorphism, 574 
homogeneous space, 531 
Hopheld network, 611, 629 
horizontal asymptotes, 32 
hyperbolic tangent, 28, 181, 344 
hyperparameter, 59, 93, 193, 212 
hyperplane, 33 
hypersurface, 43 

I 

identity function, 21 
image measure, 703 
incompressible, 93 
indefinite integral, 703 
independent layer, 375 
indicator function, 33, 228, 

320, 698 

infmitesimal operator, 118 
information 

assessment, 351 
bottleneck, 393 
compression, 355 
compressor, 381 
field, 329 
loss, 356 
path, 399 
plane, 398 

information geometry, 466 


initialization, 176 
inner prodnct, 44 
input, 41, 321 
entropy, 49 
information, 325 
variable, 48 
input information, 318 
input-output 

function, 360, 578 
input-output mapping, 171, 218 
integral kernel, 304 
integral transform, 207, 225 
intersection, 691 
intrinsic, 441 
invariance, 537 
invariance property, 366 
Inverse Function Theorem, 78, 616, 
729 

Irie and Miyake’s integral formula, 297 
iterative formula, 155 
Ito process, 118 
Ito’s 

diffusion, 719 
formula, 719 

J 

Jacobian, 78, 344, 354, 360, 548, 616 
Jeffrey distance, 55 
Jensen-Shannon divergence, 51 
joint density, 48 
joint entropy, 353 

Jordan Measure Decomposition, 706 
jump discontinuity, 143 

K 

kernel, 207, 518, 520 
kinetic energy, 90 
Kirchhoff law, 6 
Kolmogorov’s Theorem, 296 
Kullback-Leibler divergence, 48, 50, 
365, 397, 468, 550, 592, 620 


756 


Deep Learning Architectures, A Mathematical Approach 


L 

Lagrange 

multiplier, 58 
Langevin’s equation, 120 
Laplace potential, 214 
Laplacian, 73 

Law of Large Numbers, 716 

layer, 178 

learning 

continuous fnnctions, 255 
decision maps, 578 
finite support functions, 285 
integrable fnnctions, 266 
measurable functions, 268 
Solutions of ODEs, 280 
square integrable functions, 261 
with information, 438 
with ReLU, 237 
with Softplns, 242 
learning rate, 152, 155, 189 
least mean squares, 152 
Lebesgue measure, 161, 207, 214, 
261, 352, 354, 564, 566, 
700-702 
left search, 121 
level 

curve, 105 
hypersurfaces, 73 
sets, 73 

level curves, 63 
Levi-Civita connection, 426 
line search method, 86 
linear 

function, 21 
functional, 256 
transformation, 169 
linear connection, 426 
linear neuron, 21, 152, 171, 184, 

333, 360, 476, 574 
linear operator, 722 


linearly independent, 569 
Lipschitz continuous, 75 
local minimum, 71 
log-likelihood, 352, 359 
log-likelihood function, 149 
logistic 

function, 27, 143, 181, 185, 216, 
229, 265, 300 
regression, 145 
logit, 28 

loss function, 41, 549 
lost information, 325 
LSTM, 554 
Luzin’s Theorem, 704 

M 

Madaline, 158 
manifold, 417 
Markov 

chain, 368 
inequality, 716 
property, 369 
master eqnations, 183 
matrix form, 183 
max pooling, 507 
maximal 

element, 327 
rank, 345 

maximum likelihood, 50, 150 
maximum mean discrepancy, 52 
McCulloch-Pitts neuron, 38 
mean squared error, 47 
Mean Value Theorem, 204, 544 
meaningful information, 394 
measurable 

function, 699 
set, 320, 328 
space, 254, 697 
measure, 33, 34, 254 
Mercer’s Theorem, 308, 311 
metric 

space, 202, 254, 269, 721, 727 


Index 


757 


structure, 439 
min pooling, 508 
mini-batch, 156 
minimal submanifold, 429 
MNIST data set, 62, 400, 437, 514 
model 

averaging, 453 
combination, 454 
moment matching, 606 
momentum method, 93 
Monotone Convergence Theorem, 702 
Moore-Penrose pseudoinverse, 389, 
450, 736 

moving average, 518 
multiplicative noise, 459 
mutual information, 364, 365 

N 

natural gradient, 487 
negligible, 701 
neighborhood search, 121 
Nesterov Accelerated Gradient, 100 
neural network, 42, 167 
neural networks, 167 
neuromanifold, 474, 498 
neuron, 9, 167 
Newton’s method, 115 
Newton-Raphson method, 116 
noise removal, 173 
noiseless coding theorem, 378 
noisy neuron, 466 
nondegenerate, 154 
nonnegative definite, 308 
non-recoverable information, 

327, 328 
norm, 59, 721 

of an operator, 722 
regularization, 445 
normal distribution, 193, 710 
normal equation, 43 


normalized initialization, 195 
normed linear space, 255 
normed space, 721 
null set, 705 

O 

odds to favor, 144 
Ohm’s law, 6 

one-hidden layer neural net, 198, 207, 
217, 239, 259, 311 
one-hot-vector, 31, 401, 567, 576 
Online learning, 156 
optimal 

parameter value, 443 
point, 442 
OR, 136 
orbit, 531 

Ornstein-Uhlenbeck, 62, 120 
orthogonal, 261 
projection, 436 
orthonormal projection, 43 
orthonormal System, 724 
output, 41, 178 

information, 318, 325, 329 
output information, 438 
overdetermined System, 737 
over fit, 381 

overhtting, 59, 417, 454 
removal, 64 

P 

parameters, 42 
partition, 563 
partition function, 616 
Parzen window, 55 
perceptron, 136, 171, 227, 254, 325, 
565 

network, 237 
capacity, 390 
learning algorithm, 163 
model, 135 

periodic function, 211, 223, 224 


758 


Deep Learning Architectures, A Mathematical Approach 


Picard-Lindelof theorem, 76 
piecewise linear, 30 
plateau, 116, 187 
pointwise convergence, 201 
polygonal line, 87 
polynomial regression, 448 
pooling, 355, 507 
layer, 513 

positive definite, 154 
potential 

energy, 90 
function, 59 
predictor, 44 
PReLU, 23 
pressure, 4 

primitive function, 134 
prior, 195 
probability 

measure, 701 
space, 709 

probability measure, 161 
proximity, 41 

Q 

quadratio 

error function, 153 
form, 72 
function, 72 

quadratio Renyi entropy, 55 
quotient set, 562 

R 

Radon-Nikodym theorem, 

161, 703 
rank, 88, 128 

recoverable information, 329 
Rectification Theorem, 587 
recurrent neural net, 217 
regular cross- entropy function, 
185 

regularization, 58, 205, 216, 443 
regularized cost function, 63 


ReLU, 23, 187, 215, 239 
Renyi entropy, 55 
resistor, 5 

restricted Boltzmann machines, 635 
Riemannian 

distance, 424 
manifold, 424, 468 
metric, 423, 440, 450, 472 
Riesz Theorem, 725 
right search, 121 
RMSProp, 101 
RNN, 543, 546 
rooks problem, 7, 630 
rotations group, 531 

S 

sample estimation, 55 

second fundamental form, 427, 445 

SELU, 24 

separability, 570 

separate points, 209, 260 

sequential continuity, 700 

Shannon 

entropy, 48-50 
sigmoid 

activation function, 232 
function, 26, 204 
neural network, 236 
neuron, 143, 206, 224, 333, 450 
sigmoidal function, 32, 34, 259, 272 
signal, 168 
signed measures, 705 
simple function, 698 
simplex algorithm, 73 
simulated annealing method, 104, 499 
613, 629 

sine, 211 
SLU, 24 

softmax, 31, 389, 582 
softplus function, 25, 242 
softsign, 29 


Index 


759 


spectral radius, 734 
Sprecher’s Theorem, 297 
square integrable functions, 45 
squared error approximation, 277 
squashing function, 36, 286 
state system, 543 
statistical manifold, 466 
steepest descent, 79, 82 
step function, 21, 36 
stochastic 

gradient descent, 189 
process, 713 
search, 116 
spherical search, 124 
Stochastic neuron, 611 
stocks, 212 

Stone-Weierstrass Theorem, 208, 
260, 306 
stride, 522 
sub-manifold, 426 
sum of squared errors, 254 
supervised learning, 69 
SVHN data set, 63 
symmetric difference, 328 
synapse, 9 

T 

tangent line, 70, 116 

tangent plane, 437 

tangent vector field, 423 

target function, 41, 176, 218, 251 

target information, 318 

taxi-cab distance, 254 

Taylor approximation, 70, 114 

temperature, 105 

tensor, 693 

test 

error, 59 
thermodynamic 
system, 107 

thermodynamical system, 618 


threshold step function, 22 
Toeplitz 

matrix, 385 
property, 387 
total compression, 380 
total energy, 91 
totally geodesic, 429 
training 

data, 58 

distribution, 48, 466 
evens, 320 
measure, 320 
set, 60 
transistor, 6 
transition function, 544 
translation 

invariance, 510 
operator, 521 
transpose, 733 
triangle inequality, 218 

U 

uncompressed layer, 340, 342 
underht, 66, 378, 381 
underlying distribution, 65 
uniform convergence, 303 
uniform convergent sequence, 203 
uniform distribution, 198 
uniformly bounded, 203, 205 
uniformly continuous, 230 
union, 691 

univariate gaussian, 157 
universal approximators, 251 

V 

validation set, 60, 193 
vanishing gradient, 187, 545, 554 
variance, 191, 711 

approximation, 712 
voltage, 5 


760 


Deep Learning Architectures, A Mathematical Approach 


W 

weak convergence, 324, 716 
weights, 178 

initialization, 191 
Wiener’s Tauberian Theorem, 
213, 214, 268 


X 

Xavier initialization, 193 
XOR, 138, 171, 286 

Z 

Zorn’s Lemma, 327, 692 


